介绍
您是否曾经需要分析非结构化的文本数据? Python可以帮助您做到这一点。在本文中,我们将研究使用Python在MongoDB中解析非结构化文本数据的基本示例。
让我们写代码
首先,让我们连接到mongoDB客户端,然后从数据库中的集合中检索所有文档:
import pymongo
import re
from matplotlib import pyplot as plt
# Connect to a MongoDB client
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["your_db"]
col = db["your_collection"]
接下来,我们将创建一个空字符串,该字符串将包含每个文档中的所有报价详细信息:
# Create an empty string that contains all texts
all_details_string = ''
# Iterate over all documents in your MongoDB instance
for doc in col.find():
all_details_string = all_details_string + doc.get('offer_details').upper()
# Create a list containing all words
doc_general = re.split(" |/|\n", all_details_string)
# Get all words count
all_words_count = len(doc_general)
print('Total Occurrences:', all_words_count)
# Set technologies to analise
tec_list = ['Java', 'C#', 'Angular', 'React']
count_list = []
在我们要分析的技术列表中,遍历每个技术,计算串联字符串中的发生数量,并将计数添加到我们的列表中:
for tec in tec_list:
count_list.append(doc_general.count(tec.upper()))
print(tec, 'Total Occurrences:', doc_general.count(tec.upper()))
在我的环境中,我得到了以下结果:
最后,我们将使用matplotlib库创建一个条形图,其技术在X轴上及其各自的Y轴计数:
# Create a bar chart
labels = tec_list
values = count_list
fig, ax = plt.subplots()
ax.bar(labels, values)
# Add labels and title
ax.set_ylabel('Occurrence')
ax.set_xlabel('Technology')
ax.set_title('Occurrences of the chosen technologies')
# Show the chart
plt.show()
完整的代码和最终想法
通过分析非结构化的文本数据,我们可以深入了解最常见的单词和主题。这对于广泛的应用程序(例如情感分析,主题建模等)可能很有用。
在这里完整代码:
import pymongo
import re
from matplotlib import pyplot as plt
# Connect to a MongoDB client
client = pymongo.MongoClient(“mongodb://localhost:27017/”)
db = client[“your_db”]
col = db[“your_collection”]
# Create an empty string that contains all texts
all_details_string = ‘’
list_detail_offer = []
# Iterate over all documents in your MongoDB instance
for doc in col.find():
all_details_string = all_details_string + doc.get(‘offer_details’).upper()
# Create a list containing all words
doc_general = re.split(“ |/|\n”, all_details_string)
# Get all words count
all_words_count = len(doc_general)
print(‘Total Occurrences:’, all_words_count)
# Set technologies to analise
tec_list = [‘Java’, ‘C#’, ‘Angular’, ‘React’]
count_list = []
for tec in tec_list:
count_list.append(doc_general.count(tec.upper()))
print(tec, ‘Total Occurrences:’, doc_general.count(tec.upper()))
# Create a bar chart
labels = tec_list
values = count_list
fig, ax = plt.subplots()
ax.bar(labels, values)
# Add labels and title
ax.set_ylabel(‘Occurrence’)
ax.set_xlabel(‘Technology’)
ax.set_title(‘Occurrences of the chosen technologies’)
# Show the chart
plt.show()
谢谢!