医学疾病信息检索系统:代码文档和实施指南
#python #datascience #data #nlp

医学疾病信息检索系统是一个Python程序,旨在根据用户查询检索疾病信息。本文档是理解代码和有效实施系统的综合指南。该系统利用自然语言处理(NLP)技术,例如文本预处理 tf-idf vectorization,以计算 cesine相似性在用户查询和与疾病相关的文件集合之间。

代码概述:
该代码分为几个部分,每个部分负责特定任务。让我们详细探讨每个部分:

导入所需库:
代码的初始行导入必要的库,包括用于文件处理的OS,用于自然语言处理的NLTK以及用于TF-IDF矢量化的Sklearn。

import os
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# nltk.download('stopwords')

预处理数据集:
该代码从“数据集”文件夹中读取文本文件,并预处理文档。它删除了停止词,使用雪球系统制作单词,并创建一个预处理的文档语料库。

# set up stop words and stemmer
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')

# read all txt files from Data Set folder
folder_path = 'Data Set'
docs = []
for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    with open(file_path, 'r') as file:
        doc = file.read()
        docs.append(doc)

# pre-process the documents
preprocessed_docs = []
for doc in docs:
    preprocessed_doc = []
    for line in doc.split('\n'):
        if ':' in line:
            line = line.split(':')[1].strip()
        words = line.split()
        words = [word for word in words if word.lower() not in stop_words]
        words = [stemmer.stem(word) for word in words]
        preprocessed_doc.extend(words)
    preprocessed_doc = ' '.join(preprocessed_doc)
    preprocessed_docs.append(preprocessed_doc)

生成TF-IDF矩阵:
使用预处理的文档语料库,该代码采用Sklearn的TFIDFECTORIZER生成TF-IDF矩阵。该矩阵表示每个文档中术语的重要性。

# generate tf-idf matrix of the terms
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_docs)

用户查询处理:
该代码提示用户输入查询,并通过删除停止字并驱动单词来进行预处理。然后使用此处理的查询来计算查询和每个文档之间的余弦相似性。

query = ''
while (query != 'quit'):
    # prompt the user to enter a query
    query = input('\nEnter your query: ')

    # pre-process the query
    preprocessed_query = []
    for word in query.split():
        if word.lower() not in stop_words:
            word = stemmer.stem(word)
            preprocessed_query.append(word)
    preprocessed_query = ' '.join(preprocessed_query)

    # calculate the cosine similarity between the query and each document
    cosine_similarities = tfidf_matrix.dot(
        vectorizer.transform([preprocessed_query]).T).toarray().flatten()

检索疾病信息:
根据余弦的相似性评分,该代码确定了最相似的文件,并提取了与疾病相关的信息,例如名称,患病率,危险因素,症状,治疗和预防措施。

    # find the index of the most similar document
    most_similar_doc_index = cosine_similarities.argsort()[::-1][0]

    # retrieve the disease information from the most similar document
    most_similar_doc = docs[most_similar_doc_index]
    disease_name = ''
    prevalence = ''
    risk_factors = ''
    symptoms = ''
    treatments = ''
    preventive_measures = ''

    for line in most_similar_doc.split('\n'):
        if line.startswith('Disease Name:'):
            disease_name = line.split(':')[1].strip()
        elif line.startswith('Prevalence:'):
            prevalence = line.split(':')[1].strip()
        elif line.startswith('Risk Factors:'):
            risk_factors = line.split(':')[1].strip()
        elif line.startswith('Symptoms:'):
            symptoms = line.split(':')[1].strip()
        elif line.startswith('Treatments:'):
            treatments = line.split(':')[1].strip()
        elif line.startswith('Preventive Measures:'):
            preventive_measures = line.split(':')[1].strip()

显示疾病信息:
最后,该代码将检索到的疾病信息与用户最相似的文档打印出来。

    # print the disease information
    print(f"\nDisease Name: {disease_name}\n")
    print(f"Prevalence: {prevalence}\n")
    print(f"Risk Factors: {risk_factors}\n")
    print(f"Symptoms: {symptoms}\n")
    print(f"Treatments: {treatments}\n")
    print(f"Preventive Measures: {preventive_measures}\n\n")

实施和用法:
要实施和使用疾病信息检索系统,请遵循以下步骤:

数据准备:

创建数据集文件夹,并将包含疾病信息的文本文件放在此文件夹中。
确保文本文件遵循特定的格式,例如以“疾病名称:”,“ Percenty:'等”等特定标签启动每个部分。

安装所需库:

通过运行pip install nltk scikit-learn安装所需的库。

预处理和TF-IDF矩阵生成:

运行提供的代码,确保下载了NLTK停止词软件包(不需要nltk.download('stopwords')行(如有必要))。

该代码将预处理文档并生成TF-IDF矩阵。

用户查询和疾病信息检索:

提示时在控制台中输入查询。
该系统将处理查询,计算余弦相似性并检索最相关的疾病信息。

疾病信息检索系统是根据用户查询检索与疾病相关信息的强大工具。通过利用NLP技术和TF-IDF矢量化,该系统可以为疾病提供宝贵的见解,包括其患病率,危险因素,症状,治疗和预防措施。通过遵循本文档中提供的实施指南,您可以设置系统并有效,准确地检索疾病信息。

注意:必须确保“数据集”文件夹中与疾病相关文档的正确格式和组织以获得准确的结果。

完成代码:

import os
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# nltk.download('stopwords')

# set up stop words and stemmer
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')

# read all txt files from Data Set folder
folder_path = 'Data Set'
docs = []
for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    with open(file_path, 'r') as file:
        doc = file.read()
        docs.append(doc)

# pre-process the documents
preprocessed_docs = []
for doc in docs:
    preprocessed_doc = []
    for line in doc.split('\n'):
        if ':' in line:
            line = line.split(':')[1].strip()
        words = line.split()
        words = [word for word in words if word.lower() not in stop_words]
        words = [stemmer.stem(word) for word in words]
        preprocessed_doc.extend(words)
    preprocessed_doc = ' '.join(preprocessed_doc)
    preprocessed_docs.append(preprocessed_doc)

# generate tf-idf matrix of the terms
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_docs)

query = ''
while (query != 'quit'):
    # prompt the user to enter a query
    query = input('\nEnter your query: ')

    # pre-process the query
    preprocessed_query = []
    for word in query.split():
        if word.lower() not in stop_words:
            word = stemmer.stem(word)
            preprocessed_query.append(word)
    preprocessed_query = ' '.join(preprocessed_query)

    # calculate the cosine similarity between the query and each document
    cosine_similarities = tfidf_matrix.dot(
        vectorizer.transform([preprocessed_query]).T).toarray().flatten()

    # find the index of the most similar document
    most_similar_doc_index = cosine_similarities.argsort()[::-1][0]

    # retrieve the disease information from the most similar document
    most_similar_doc = docs[most_similar_doc_index]
    disease_name = ''
    prevalence = ''
    risk_factors = ''
    symptoms = ''
    treatments = ''
    preventive_measures = ''

    for line in most_similar_doc.split('\n'):
        if line.startswith('Disease Name:'):
            disease_name = line.split(':')[1].strip()
        elif line.startswith('Prevalence:'):
            prevalence = line.split(':')[1].strip()
        elif line.startswith('Risk Factors:'):
            risk_factors = line.split(':')[1].strip()
        elif line.startswith('Symptoms:'):
            symptoms = line.split(':')[1].strip()
        elif line.startswith('Treatments:'):
            treatments = line.split(':')[1].strip()
        elif line.startswith('Preventive Measures:'):
            preventive_measures = line.split(':')[1].strip()

    # print the disease information
    print(f"\nDisease Name: {disease_name}\n")
    print(f"Prevalence: {prevalence}\n")
    print(f"Risk Factors: {risk_factors}\n")
    print(f"Symptoms: {symptoms}\n")
    print(f"Treatments: {treatments}\n")
    print(f"Preventive Measures: {preventive_measures}\n\n")

注意:它不是一个完美的系统,因此不建议仅依靠它或盲目遵循它,但是它是一个很好的解决方案可以帮助他们避免进一步损失健康的预防措施。这些预防措施不是有害的。

对于数据集和更详细的github链接:https://github.com/muneebkhan4/Medical-Disease-Information-Retrival-System