使用Python的探索性数据分析的初学者指南-DEV365 开发者社区

探索性数据分析（EDA）是一种分析和汇总数据集的方法，以了解其主要特征和检测模式，关系和异常。 EDA通常在数据分析项目的开头进行，用于洞悉数据并通过数据质量或分析方法确定潜在问题。

EDA可以涉及一系列统计和可视化技术，例如摘要统计，直方图，框图，散点图和相关矩阵。 EDA的目的是发现数据的主要特征，例如其分布，范围，中心趋势，可变性以及任何异常值或缺失值。该信息可用于指导开发更复杂的统计模型或机器学习算法。

EDA中涉及的一些常见任务包括检查缺失值的数据，探索变量之间的关系，识别异常或异常情况以及检测数据中的模式或趋势。 EDA还可以涉及数据转换，例如标准化或缩放，以帮助使数据更加适应分析。最终，EDA的目标是更好地了解数据及其基础结构，以便对如何分析它做出更明智的决定。

在这篇博客文章中，我们将介绍一些基本概念和技术，以帮助您开始使用EDA。我们将在https://raw.githubusercontent.com/junn-hope/LuxAcademyBootcamp/main/IT_SalarySurvey_EU2020.csv上获得IT工资调查数据集上的EDA。 IT薪水调查数据集包含有关欧洲IT专业人士的薪水的信息，以及他们的职称，多年的经验和其他人口统计信息。

导入数据

任何数据分析项目的第一步是将数据导入您的编程环境。 IT工资调查数据集以CSV格式获得，可以使用Python中的 pandas 库导入：

import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
from ydata_profiling import ProfileReport
import ipywidgets as widgets


url = 'https://raw.githubusercontent.com/junn-hope/LuxAcademyBootcamp/main/IT_SalarySurvey_EU2020.csv'
df = pd.read_csv(url)

read_csv 函数从指定的URL读取CSV文件并创建Pandas DataFrame。
导入数据后，我们可以开始使用各种技术进行探索。

了解数据集

您导入数据集后，重要的是要对其结构和属性进行基本了解。以下是有关数据集的一些问题：

数据集有多少行和列？
列的名称是什么？
每列的数据类型是什么？
是否有缺少的值？

我们可以使用 shape ，列， info 和 Dricesd 来回答我们的问题，上面摆姿势：

# print the shape of the data (number of rows and columns)
print(df.shape)

# print the names of the columns
print(df.columns)

# print information about the data, including data types and number of non-null values
print(df.info())

# print summary statistics for the numeric columns
print(df.describe())

# print the sum of all null values
print(df.isnull().sum())

输出如图所示：

RangeIndex: 1253 entries, 0 to 1252
Data columns (total 23 columns):
 #   Column                                                                                                                   Non-Null Count  Dtype  
---  ------                                                                                                                   --------------  -----  
 0   Timestamp                                                                                                                1253 non-null   object 
 1   Age                                                                                                                      1226 non-null   float64
 2   Gender                                                                                                                   1243 non-null   object 
 3   City                                                                                                                     1253 non-null   object 
 4   Position                                                                                                                 1247 non-null   object 
 5   Total years of experience                                                                                                1237 non-null   object 
 6   Years of experience in Germany                                                                                           1221 non-null   object 
 7   Seniority level                                                                                                          1241 non-null   object 
 8   Your main technology / programming language                                                                              1126 non-null   object 
 9   Other technologies/programming languages you use often                                                                   1096 non-null   object 
 10  Yearly brutto salary (without bonus and stocks) in EUR                                                                   1253 non-null   float64
 11  Yearly bonus + stocks in EUR                                                                                             829 non-null    object 
 12  Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country                 885 non-null    float64
 13  Annual bonus+stocks one year ago. Only answer if staying in same country                                                 614 non-null    object 
 14  Number of vacation days                                                                                                  1185 non-null   object 
 15  Employment status                                                                                                        1236 non-null   object 
 16  Сontract duration                                                                                                        1224 non-null   object 
 17  Main language at work                                                                                                    1237 non-null   object 
 18  Company size                                                                                                             1235 non-null   object 
 19  Company type                                                                                                             1228 non-null   object 
...
 21  Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week                        373 non-null    float64
 22  Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR  462 non-null    object 
dtypes: float64(4), object(19)
memory usage: 225.3+ KB

从这些功能的输出中，我们可以看到数据集包含1,253行和23列。可以通过运行df.columns找到列名，如下所示：

print(df.columns)

谁的输出为：

数据集包含分类数据和数值数据，并且在某些列中有一些缺失值：

df_length = len(df)
missing_percentages = df.isna().sum().sort_values(ascending= False)/df_length

missing_percentages

理解数据的一种简单方法是通过ydata_profiling库包运行profilereport。

profile = ProfileReport(df, title ="IT Survey Profile Report", html={'style':{'full_width':True}})
profile.to_notebook_iframe()

## the iframe() exports the Jupyter Notebook to HTML

输出如下所示：

清洁数据

在我们开始分析数据之前，我们需要清理数据。这涉及处理缺失的值，删除重复项并纠正数据中的任何错误。

处理缺失的值

缺少值是现实世界数据集中的常见问题。在分析数据集之前，我们需要处理这些丢失的值。有几种方法可以做到这一点，包括：

丢失的行或缺少值的列
用统计度量（例如平均值或中位数）归纳缺失值
用机器学习算法归纳缺失值

让我们从处理缺失的值开始。我们可以在熊猫中使用 fillna（）方法，用指定的值替换缺失值。
例如，如果我们想用“未知”替换“性别”列中的丢失值，我们可以做以下操作：

df['Gender'] = df['Gender'].fillna('Unknown')

df['Salary one year ago'] = df['Salary one year ago'].fillna(0)

纠正错误

最后，我们需要纠正数据中的任何错误。在此数据集中，我们可以看到“体验”列中的某些值是负面的，这显然是一个错误。我们可以通过获取列的绝对值来纠正这一点：

df['Experience'] = df['Experience'].abs()

掉落列

我们可以使用pandas中的Drop函数删除时间戳列：

data = df.drop(columns=['Timestamp'])

探索数据集

现在我们有了一个干净的数据集，我们可以开始探索它。让我们看一下数据集的一些基本统计信息：

# compute summary statistics for the dataset
print(df.describe())

      Age     Yearly_brutto_salary   Annual_brutto_salary   Shorter_working week 
count  1,226.00000  1,253.00000 885.00000   373.00000
mean   32.50979 80,279,042.57872    632,245.87232   12.96783
std    5.66380  2,825,061,107.59049 16,805,081.75171    15.27517
min    20.00000 10,001.00000    11,000.00000    0.00000
25%    29.00000 58,800.00000    55,000.00000    0.00000
50%    32.00000 70,000.00000    65,000.00000    0.00000
75%    35.00000 80,000.00000    75,000.00000    30.00000
max    69.00000 99,999,999,999.00000    500,000,000.00000   40.00000

该命令的输出表明，数据集中的年度薪水为632,245.87232欧元，中位价值为75,000.00000欧元。

可视化数据

可视化数据是获得数据集中模式和关系的初始感知的好方法。我们可以使用Python中的 matplotlib 或 seaborn 等库来创建可视化。以下是如何为IT薪金调查数据集创建可视化的一些示例：

import matplotlib.pyplot as plt
import seaborn as sns

# create a histogram of the salaries
sns.histplot(df['Current Salary'], kde=False, bins=20)
plt.title('Histogram of Salaries')
plt.xlabel('Salary (EUR)')
plt.ylabel('Count')
plt.show()

# create a box plot of the salaries by gender
sns.boxplot(x='Gender', y='Current Salary', data=data)
plt.title('Box Plot of Salaries by Gender')
plt.xlabel('Gender')
plt.ylabel('Salary (EUR)')
plt.show()

# create a scatter plot of the salaries by years of experience
sns.scatterplot