Python NLP库：全面概述-DEV365 开发者社区

Python对自然语言处理的图书馆得到了丰富的支持。从文本处理，引导文本和确定其引理，到句法分析，解析文本并分配句法角色，到语义处理，例如认识到指定的实体，情感分析和文档分类，一切都至少由一个库提供。那么，您从哪里开始？

本文的目的是为每个核心NLP任务提供相关Python库的概述。库用简短的描述进行了解释，并给出了用于NLP任务的具体代码片段。继续我的introduction to NLP blog article，本文仅显示了文本处理，句法和语义分析以及文档语义的核心NLP任务的库。此外，在NLP实用程序类别中，提供了用于语料库管理和数据集的库。

涵盖了图书馆之后：

本文最初出现在我的博客admantium.com 。

核心NLP任务

文本处理

任务：象征化，lemmatization，stemming

NLTK库提供了一个完整的工具包，用于文本处理，包括令牌化，词干和lemmatization。

from nltk.tokenize import sent_tokenize, word_tokenize

paragraph = '''Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.'''

sentences = []
for sent in sent_tokenize(paragraph):
  sentences.append(word_tokenize(sent))

sentences[0]
# ['Artificial', 'intelligence', 'was', 'founded', 'as', 'an', 'academic', 'discipline'

使用TextBlob，支持相同的文本处理任务。它通过更高级的语义结果及其易于使用的数据结构将自己与NLTK区分开：解析句子已经产生丰富的语义信息。

from textblob import TextBlob

text = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''

blob = TextBlob(text)

blob.ngrams()
#[WordList(['Artificial', 'intelligence', 'was']),
# WordList(['intelligence', 'was', 'founded']),
# WordList(['was', 'founded', 'as']),

blob.tokens
# WordList(['Artificial', 'intelligence', 'was', 'founded', 'as', 'an', 'academic', 'discipline', 'in', '1956', ',', 'and', 'in',

以及使用现代NLP库Spacy，文本处理只是大多数语义任务的丰富管道中的第一步。与其他库不同，首先需要加载目标语言的模型。最近的模型不是启发式方法，而是人工神经网络，尤其是变形金刚，可提供更丰富的抽象，并且可以更好地与他人合并。

import spacy
nlp = spacy.load('en_core_web_lg')

text = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''

doc = nlp(text)
tokens = [token for token in doc]

print(tokens)
# [Artificial, intelligence, was, founded, as, an, academic, discipline

句法分析

任务：解析，词性标记，名词短语提取

从NLTK开始，支持所有句法任务。它们的输出作为Python本机数据架构提供，并且可以始终显示为简单的文本输出。

from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser

text = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''

pos_tag(word_tokenize(text))
# [('Artificial', 'JJ'),
#  ('intelligence', 'NN'),
#  ('was', 'VBD'),
#  ('founded', 'VBN'),
#  ('as', 'IN'),
#  ('an', 'DT'),
#  ('academic', 'JJ'),
#  ('discipline', 'NN'),

# noun chunk parser
# source: https://www.nltk.org/book_1ed/ch07.html
grammar = "NP: {<DT>?<JJ>*<NN>}"
parser = RegexpParser(grammar)

parser.parse(pos_tag(word_tokenize(text)))
#(S
#  (NP Artificial/JJ intelligence/NN)
#  was/VBD
#  founded/VBN
#  as/IN
#  (NP an/DT academic/JJ discipline/NN)
#  in/IN
#  1956/CD

TextBlob在处理文本时立即提供POS标签。使用另一种方法，创建了包含丰富句法信息的解析树。

from textblob import TextBlob

text = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''

blob = TextBlob(text)
blob.tags
#[('Artificial', 'JJ'),
# ('intelligence', 'NN'),
# ('was', 'VBD'),
# ('founded', 'VBN'),

blob.parse()
# Artificial/JJ/B-NP/O
# intelligence/NN/I-NP/O
# was/VBD/B-VP/O
# founded/VBN/I-VP/O

Spacy库使用变压器神经网络来支持其句法任务。

import spacy
nlp = spacy.load('en_core_web_lg')

for token in nlp(text):
    print(f'{token.text:<20}{token.pos_:>5}{token.tag_:>5}')

#Artificial            ADJ   JJ
#intelligence         NOUN   NN
#was                   AUX  VBD
#founded              VERB  VBN

语义分析

任务：命名实体识别，单词sense disambuation，语义角色标签

语义分析是NLP方法开始不同的领域。使用NLTK时，将在字典中查找生成的句法信息，以识别例如命名实体。因此，在使用较新的文本时，可能无法识别实体。

from nltk import download as nltk_download
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk

nltk_download('maxent_ne_chunker')
nltk_download('words')

text = '''
As of 2016, only three nations have flown crewed spacecraft: USSR/Russia, USA, and China. The first crewed spacecraft was Vostok 1, which carried Soviet cosmonaut Yuri Gagarin into space in 1961, and completed a full Earth orbit. There were five other crewed missions which used a Vostok spacecraft. The second crewed spacecraft was named Freedom 7, and it performed a sub-orbital spaceflight in 1961 carrying American astronaut Alan Shepard to an altitude of just over 187 kilometers (116 mi). There were five other crewed missions using Mercury spacecraft.
'''

pos_tag(word_tokenize(text))
# [('Artificial', 'JJ'),
#  ('intelligence', 'NN'),
#  ('was', 'VBD'),
#  ('founded', 'VBN'),
#  ('as', 'IN'),
#  ('an', 'DT'),
#  ('academic', 'JJ'),
#  ('discipline', 'NN'),

# noun chunk parser
# source: https://www.nltk.org/book_1ed/ch07.html
print(ne_chunk(pos_tag(word_tokenize(text))))
# (S
#   As/IN
#   of/IN
#   [...]
#   (ORGANIZATION USA/NNP)
#   [...]
#   which/WDT
#   carried/VBD
#   (GPE Soviet/JJ)
#   cosmonaut/NN
#   (PERSON Yuri/NNP Gagarin/NNP)

使用Spacy库的变压器模型包含一个隐式的“时间戳”：它们的训练时间。这确定了哪些文本所消耗的文本，因此哪些具有该模型能够识别的文本。

import spacy
nlp = spacy.load('en_core_web_lg')

text = '''
As of 2016, only three nations have flown crewed spacecraft: USSR/Russia, USA, and China. The first crewed spacecraft was Vostok 1, which carried Soviet cosmonaut Yuri Gagarin into space in 1961, and completed a full Earth orbit. There were five other crewed missions which used a Vostok spacecraft. The second crewed spacecraft was named Freedom 7, and it performed a sub-orbital spaceflight in 1961 carrying American astronaut Alan Shepard to an altitude of just over 187 kilometers (116 mi). There were five other crewed missions using Mercury spacecraft.
'''

doc = nlp(paragraph)
for token in doc.ents:
    print(f'{token.text:<25}{token.label_:<15}')

# 2016                   DATE
# only three             CARDINAL
# USSR                   GPE
# Russia                 GPE
# USA                    GPE
# China                  GPE
# first                  ORDINAL
# Vostok 1               PRODUCT
# Soviet                 NORP
# Yuri Gagarin           PERSON

文档语义

任务：文本分类，主题建模，情感分析，毒性识别

情感分析也是NLP方法的差异不同的任务：词典中的单词含义与在单词或文档向量上编码的词相似性中查找。

TextBlob具有内置的情感分析，可返回文本中的极性（总体或负面含义）和主观性（个人意见程度）。

from textblob import TextBlob

text = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''

blob = TextBlob(text)
blob.sentiment
#Sentiment(polarity=0.16180290297937355, subjectivity=0.42155589508530683)

Spacy不包含文本分类功能，但可以作为单独的管道步骤扩展。以下代码很长，包含几个Spacy内部对象和数据结构 - 未来的文章将更详细地解释这一点。

## train single label categorization from multi-label dataset
def convert_single_label(dataset, filename):
    db = DocBin()
    nlp = spacy.load('en_core_web_lg')

    for index, fileid in enumerate(dataset):
        cat_dict = {cat: 0 for cat in dataset.categories()}
        cat_dict[dataset.categories(fileid).pop()] = 1

        doc = nlp(get_text(fileid))
        doc.cats = cat_dict

        db.add(doc)

    db.to_disk(filename)

## load trained model and apply to text
nlp = spacy.load('textcat_multilabel_model/model-best')

text = dataset.raw(42)

doc = nlp(text)

estimated_cats = sorted(doc.cats.items(), key=lambda i:float(i[1]), reverse=True)

print(dataset.categories(42))
# ['orange']

print(estimated_cats)
# [('nzdlr', 0.998894989490509), ('money-supply', 0.9969857335090637), ... ('orange', 0.7344251871109009),

SciKit Learn是一个通用机器学习库，可提供许多聚类和分类算法。它仅在数值输入上起作用，因此要求文本要进行矢量化，例如使用Gensims预训练的单词向量或使用内置的功能向量器。仅举一个例子，这是一个将原始文本转换为单词向量的摘要，然后将Kmeans分类器应用于它们。

from sklearn.feature_extraction import DictVectorizer
from sklearn.cluster import KMeans

vectorizer = DictVectorizer(sparse=False)
x_train = vectorizer.fit_transform(dataset['train'])

kmeans = KMeans(n_clusters=8, random_state=0, n_init="auto").fit(x_train)

print(kmeans.labels_.shape)
# (8551, )

print(kmeans.labels_)
# [4 4 4 ... 6 6 6]

最后，Gensim是一个专门用于大规模语料库主题分类的图书馆。以下片段加载一个内置数据集，对每个文档的令牌进行矢量化，并执行聚类算法LDA。仅在CPU上运行时，最多可以耗时15分钟。

# source: https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html, https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html

import logging
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import LdaModel

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

docs = api.load('text8')
dictionary = Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]

_ = dictionary[0]
id2word = dictionary.id2token

# Define and train the model
model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=2000,
    alpha='auto',
    eta='auto',
    iterations=400,
    num_topics=10,
    passes=20,
    eval_every=None
)

print(model.num_topics)
# 10

print(model.top_topics(corpus)[6])
#  ([(4.201401e-06, 'done'),
#    (4.1998064e-06, 'zero'),
#    (4.1478743e-06, 'eight'),
#    (4.1257395e-06, 'one'),
#    (4.1166854e-06, 'two'),
#    (4.085097e-06, 'six'),
#    (4.080696e-06, 'language'),
#    (4.050306e-06, 'system'),
#    (4.041121e-06, 'network'),
#    (4.0385708e-06, 'internet'),
#    (4.0379923e-06, 'protocol'),
#    (4.035399e-06, 'open'),
#    (4.033435e-06, 'three'),
#    (4.0334166e-06, 'interface'),
#    (4.030141e-06, 'four'),
#    (4.0283044e-06, 'seven'),
#    (4.0163245e-06, 'no'),
#    (4.0149207e-06, 'i'),
#    (4.0072555e-06, 'object'),
#    (4.007036e-06, 'programming')],

公用事业

语料库管理

NLTK以JSON格式为纯粹的读者提供纯文本，降价，甚至是Twitter提要。它通过传递文件路径而创建的，然后提供基本的统计信息以及迭代器以工作所有文件。

from  nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpus = PlaintextCorpusReader('wikipedia_articles', r'.*\.txt')

print(corpus.fileids())
# ['AI_alignment.txt', 'AI_safety.txt', 'Artificial_intelligence.txt', 'Machine_learning.txt', ...]

print(len(corpus.sents()))
# 47289

print(len(corpus.words()))
# 1146248

Gensim处理文本文件以形成每个文档的单词向量表示，然后可以将其用于其主要用例主题分类。这些文档需要由迭代器处理，该迭代器包裹遍历目录，然后将语料库作为单词向量集合构建。但是，此语料库表示很难与其他库进行外部化和重复使用。以下片段是从上方的摘录 - 它将加载Gensim中包含的数据集，然后创建一个基于单词矢量的表示。

import gensim.downloader as api
from gensim.corpora import Dictionary

docs = api.load('text8')
dictionary = Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]

print('Number of unique tokens: %d' % len(dictionary))
# Number of unique tokens: 253854

print('Number of documents: %d' % len(corpus))
# Number of documents: 1701

数据集

NLTK提供了几个现成的数据集，例如路透社新闻摘录，欧洲议会诉讼和古腾堡系列的开放书籍。参见complete dataset and model list。

from nltk.corpus import reuters

print(len(reuters.fileids()))
#10788

print(reuters.categories()[:43])
# ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil']

SciKit Learn包括新闻组，房地产甚至IT入侵检测的数据集，请参见complete list。这是后者的快速示例。

from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups()
dataset.data[1]
# "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll.

结论

对于Python的NLP项目，存在丰富的图书馆选择。为了帮助您入门，本文提供了NLP任务驱动的概述，并提供紧凑的库说明和代码片段。从文本处理开始，您看到了如何从文本中创建令牌和引理。继续进行句法分析，您学会了如何生成词性标签和句子的语法结构。并到达语义，识别文本中的命名实体以及文本情感也可以用几行代码来解决。对于语料库管理和访问预制数据集的其他任务，您还看到了库示例。总而言之，本文应为您提供一个良好的开始，进入下一个NLP项目HEN在核心NLP任务上工作。

NLP方法在使用神经网络（尤其是大语言模型）方面的演变触发了完整的新库的创建和适应，从文本矢量化，神经网络定义和培训以及语言生成和许多模型的应用开始更多的。这些模型涵盖了所有高级NLP任务，并将在以后的文章中涵盖。