Python和MongoDB的基本文本分析-DEV365 开发者社区

介绍

您是否曾经需要分析非结构化的文本数据？ Python可以帮助您做到这一点。在本文中，我们将研究使用Python在MongoDB中解析非结构化文本数据的基本示例。

让我们写代码

首先，让我们连接到mongoDB客户端，然后从数据库中的集合中检索所有文档：

import pymongo
import re
from matplotlib import pyplot as plt

# Connect to a MongoDB client
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["your_db"]
col = db["your_collection"]

接下来，我们将创建一个空字符串，该字符串将包含每个文档中的所有报价详细信息：

# Create an empty string that contains all texts
all_details_string = ''

# Iterate over all documents in your MongoDB instance
for doc in col.find():
    all_details_string = all_details_string + doc.get('offer_details').upper()

# Create a list containing all words
doc_general = re.split(" |/|\n", all_details_string)

# Get all words count
all_words_count = len(doc_general)
print('Total Occurrences:', all_words_count)

# Set technologies to analise
tec_list = ['Java', 'C#', 'Angular', 'React']
count_list = []

在我们要分析的技术列表中，遍历每个技术，计算串联字符串中的发生数量，并将计数添加到我们的列表中：

for tec in tec_list:
    count_list.append(doc_general.count(tec.upper()))
    print(tec, 'Total Occurrences:', doc_general.count(tec.upper()))

在我的环境中，我得到了以下结果：

最后，我们将使用matplotlib库创建一个条形图，其技术在X轴上及其各自的Y轴计数：

# Create a bar chart
labels = tec_list
values = count_list
fig, ax = plt.subplots()
ax.bar(labels, values)

# Add labels and title
ax.set_ylabel('Occurrence')
ax.set_xlabel('Technology')
ax.set_title('Occurrences of the chosen technologies')

# Show the chart
plt.show()

完整的代码和最终想法

通过分析非结构化的文本数据，我们可以深入了解最常见的单词和主题。这对于广泛的应用程序（例如情感分析，主题建模等）可能很有用。

在这里完整代码：

import pymongo
import re
from matplotlib import pyplot as plt

# Connect to a MongoDB client
client = pymongo.MongoClient(“mongodb://localhost:27017/”)
db = client[“your_db”]
col = db[“your_collection”]

# Create an empty string that contains all texts
all_details_string = ‘’
list_detail_offer = []

# Iterate over all documents in your MongoDB instance
for doc in col.find():
 all_details_string = all_details_string + doc.get(‘offer_details’).upper()

# Create a list containing all words
doc_general = re.split(“ |/|\n”, all_details_string)

# Get all words count
all_words_count = len(doc_general)
print(‘Total Occurrences:’, all_words_count)

# Set technologies to analise
tec_list = [‘Java’, ‘C#’, ‘Angular’, ‘React’]
count_list = []
for tec in tec_list:
 count_list.append(doc_general.count(tec.upper()))
 print(tec, ‘Total Occurrences:’, doc_general.count(tec.upper()))

# Create a bar chart
labels = tec_list
values = count_list
fig, ax = plt.subplots()
ax.bar(labels, values)

# Add labels and title
ax.set_ylabel(‘Occurrence’)
ax.set_xlabel(‘Technology’)
ax.set_title(‘Occurrences of the chosen technologies’)

# Show the chart
plt.show()

谢谢！