使用Python构建文件分析数据集-DEV365 开发者社区

去年，我设计了一些以视觉方式分析代码的历史和结构的方法，但我没有记录很多与社区分享的方法。本文探讨了我用来构建一个CSV文件的过程，该文件包含项目中所有源文件中的路径，扩展，项目和代码计数。

中的所有源文件。

注意：如果您从GIT存储库中获取数据的想法，请查看我有关Using PyDriller to Extract git Information
的文章

如果要快速分析硬盘驱动器上的代码文件并构建数据集以进一步分析和数据可视化，则此代码是有效的。

我共享的大部分Python代码都通过我自己的调整和更改从各种Stackoverflow答案中借来。我试图归功于我使用的所有资源，但是该代码是一年前建造的，我可能忘记了一两个消息来源。

依赖性

我使用Python 3.8在Jupyter笔记本电脑上运行Python 3.8编写了此代码。但是，此代码不需要Jupyter笔记本。

代码依赖于python标准的os库以及用于分析表格数据的pandas库。

import pandas as pd
import os

代码

下面是完成从代码中提取数据的任务所需的函数的顺序分解。

识别文件

如果不确定要分析的文件，则无法进行分析文件。

我使用os库的listdir方法以及其他几种路径和目录方法来递归遍历目录树以给定目录以递归遍历。

。

我还有意忽略已知的目录，这些目录包含大量可忽略的文件，例如报告目录，.git目录和开发环境使用的目录。

# Some directories should be ignored as their output is not helpful
ignored_directories = ['.git', '.vs', 'obj', 'ndependout', 'bin', 'debug']

def get_file_list(dir_path, paths=None):
    """
    Gets a list of files in this this directory and all contained directories
    """

    files = list()
    contents = os.listdir(dir_path)

    for entry in contents:
        path = os.path.join(dir_path, entry)
        if os.path.isdir(path):
            # Ignore build and reporting directories
            if entry.lower() in ignored_directories:
                continue

            # Maintain an accurate, but separate, hierarchy array
            if paths is None:
                p = [entry]
            else:
                p = paths[:]
                p.append(entry)

            files = files + get_file_list(path, p)
        else:
            files.append((path, paths))

    return files

确定文件是否包含源代码

为了确定文件是否为源代码文件，应包含在结果中，我选择了一种天真的方法，我只注意具有某些扩展名的文件。

此代码可以从更全面的列表中受益，甚至可以对文件内容的一些高级分析。但是，简单地看扩展是我的目的，应该很好地为大多数读者服务。

# Common source file extensions. This list is very incomplete. Intentionally not including JSON / XML
source_extensions = [
    '.cs', 
    '.vb', 
    '.java', 
    '.r', 
    '.agc', 
    '.fs', 
    '.js', 
    '.cpp', 
    '.go', 
    '.aspx', 
    '.jsp', 
    '.do', 
    '.php', 
    '.ipynb', 
    '.sh', 
    '.html', 
    '.lua', 
    '.css'
    ]

def is_source_file(file_label):
    """
    Defines what a source file is.
    """
    file, _ = file_label
    _, ext = os.path.splitext(file)

    return ext.lower() in source_extensions

计数代码

之后，我需要能够快速确定文件的长度。

我使用a Stack Overflow answer来确定如何使用以下代码有效地读取文件中的行数：

def count_lines(path):
    """
    Reads the file at the specified path and returns the number of lines in that file
    """

    def _make_gen(reader):
        b = reader(2 ** 16)
        while b:
            yield b
            b = reader(2 ** 16)

    with open(path, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))

    return count

此代码使用缓冲读取器快速读取文件缓冲区并返回总数。

Geeks for Geeks为感兴趣的人提供了较慢但更容易理解的实现：

def count_lines_simple(path):
    with open(path, 'r') as fp:
        return sum(1 for line in fp)

获取文件指标

之后，我构建了一个函数，该函数可以获取多个文件并为列表中的所有文件生成详细对象列表。

包含的每个文件详细信息对象：

root目录分析在
文件的完整路径
该文件中的项目（项目中的项目中的基本目录）
文件相对于
文件的名称
文件的扩展名
该文件中的行数

这是通过在输入文件参数中的所有文件和文件夹中循环循环，然后使用count_lines函数来计数文件行，枚举任何文件夹，并在每个文件上构建路径信息。

一旦已知所有信息，就会创建一个文件详细信息对象并将其附加到结果列表中，一旦分析了所有文件。

def get_file_metrics(files, root):
    """
    This function gets all metrics for the files and returns them as a list of file detail objects
    """
    results = []

    for file, folders in files:
        lines = count_lines(file) # Slow as it actually reads the file
        _, filename = os.path.split(file)
        _, ext = os.path.splitext(filename)

        fullpath = ''

        if folders != None and len(folders) > 0:
            project = folders[0]
            for folder in folders[1:]:
                if len(fullpath) > 0:
                    fullpath += '/'
                fullpath += folder
        else:
            project = ''

        if len(fullpath) <= 0:
            fullpath = '.'

        id = root + '/' + project + '/' + fullpath + '/' + filename

        file_details = {
                        'fullpath': id,
                        'root': root,
                        'project': project,
                        'path': fullpath,
                        'filename': filename,
                        'ext': ext,
                        'lines': lines,
                        }
        results.append(file_details)

    return results

将所有这些放在一起

最后，我构建了一个中心函数，以在给定路径中启动源分析：

def get_source_file_metrics(path):
    """
    This function gets all source files and metrics associated with them from a given path
    """
    source_files = filter(is_source_file, get_file_list(path))
    return get_file_metrics(list(source_files), path)

此功能调用其他函数以构建结果的文件详细信息列表。

我可以通过声明我感兴趣的目录列表来调用此方法，然后在这些目录上调用。

# Paths should be one or more directories of interest
paths = ['C:/Dev/MachineLearning/src', 'C:/Dev/MachineLearning/test']

# Pull Source file metrics
files = []
for path in paths:
    files.extend(get_source_file_metrics(path))

在这里，我正在分析ML.NET repository的src和test目录中的所有项目。我选择将这些作为单独的路径包括在此存储库中，因为它们代表了两个不同的项目分组。

保存结果

一旦填充了files列表，就可以很容易地使用它来创建PANDAS DataFrame用于表格分析。数据帧还提供了一种简单的方法，将数据序列化为CSV文件，如下所示：

# Grab the file metrics for source files and put them into a data frame
df = pd.DataFrame(files)
df = df.sort_values('lines', ascending=False)

# Write to a file for other analyses processes
df.to_csv('filesizes.csv')

最后，我们可以使用df.head()方法预览该数据集的前5行。

在我的示例数据集中，这会产生以下结果：

下一步

现在您已经将层次数据保存到CSV文件，并且我已经说明了a separate way to get information out of a git repository，下一步将涉及将此数据合并在一起以进行分析和可视化。

请继续关注本系列中有关可视化代码的未来更新。