Python和MongoDB的基本文本分析
#python #mongodb #matplotlib #analysis

介绍

您是否曾经需要分析非结构化的文本数据? Python可以帮助您做到这一点。在本文中,我们将研究使用Python在MongoDB中解析非结构化文本数据的基本示例。

让我们写代码

首先,让我们连接到mongoDB客户端,然后从数据库中的集合中检索所有文档:

import pymongo
import re
from matplotlib import pyplot as plt

# Connect to a MongoDB client
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["your_db"]
col = db["your_collection"]

接下来,我们将创建一个空字符串,该字符串将包含每个文档中的所有报价详细信息:

# Create an empty string that contains all texts
all_details_string = ''

# Iterate over all documents in your MongoDB instance
for doc in col.find():
    all_details_string = all_details_string + doc.get('offer_details').upper()

# Create a list containing all words
doc_general = re.split(" |/|\n", all_details_string)

# Get all words count
all_words_count = len(doc_general)
print('Total Occurrences:', all_words_count)

# Set technologies to analise
tec_list = ['Java', 'C#', 'Angular', 'React']
count_list = []

在我们要分析的技术列表中,遍历每个技术,计算串联字符串中的发生数量,并将计数添加到我们的列表中:

for tec in tec_list:
    count_list.append(doc_general.count(tec.upper()))
    print(tec, 'Total Occurrences:', doc_general.count(tec.upper()))

在我的环境中,我得到了以下结果:

最后,我们将使用matplotlib库创建一个条形图,其技术在X轴上及其各自的Y轴计数:

# Create a bar chart
labels = tec_list
values = count_list
fig, ax = plt.subplots()
ax.bar(labels, values)

# Add labels and title
ax.set_ylabel('Occurrence')
ax.set_xlabel('Technology')
ax.set_title('Occurrences of the chosen technologies')

# Show the chart
plt.show()

完整的代码和最终想法

通过分析非结构化的文本数据,我们可以深入了解最常见的单词和主题。这对于广泛的应用程序(例如情感分析,主题建模等)可能很有用。

在这里完整代码:

import pymongo
import re
from matplotlib import pyplot as plt

# Connect to a MongoDB client
client = pymongo.MongoClient(“mongodb://localhost:27017/”)
db = client[“your_db”]
col = db[“your_collection”]

# Create an empty string that contains all texts
all_details_string = ‘’
list_detail_offer = []

# Iterate over all documents in your MongoDB instance
for doc in col.find():
 all_details_string = all_details_string + doc.get(‘offer_details’).upper()

# Create a list containing all words
doc_general = re.split(“ |/|\n”, all_details_string)

# Get all words count
all_words_count = len(doc_general)
print(‘Total Occurrences:’, all_words_count)

# Set technologies to analise
tec_list = [‘Java’, ‘C#’, ‘Angular’, ‘React’]
count_list = []
for tec in tec_list:
 count_list.append(doc_general.count(tec.upper()))
 print(tec, ‘Total Occurrences:’, doc_general.count(tec.upper()))

# Create a bar chart
labels = tec_list
values = count_list
fig, ax = plt.subplots()
ax.bar(labels, values)

# Add labels and title
ax.set_ylabel(‘Occurrence’)
ax.set_xlabel(‘Technology’)
ax.set_title(‘Occurrences of the chosen technologies’)

# Show the chart
plt.show()

谢谢!