Python NLP Libary:NLTK
#python #nlp

nltk是一个复杂的库。自2009年以来,它一直在不断开发,它支持所有经典的NLP任务,来自令牌化,词干,词性标签,包括语义索引和依赖性解析。它还具有丰富的其他功能,例如内置的Corpora,其NLP任务的不同模型以及与Scikit Learn和其他Python库集成。

本文是对NLTK的简洁介绍。您将看到NLTK的操作,您可以用于各种NLP任务的简短代码nippet。

本文最初出现在我的博客admantium.com

本文的技术背景是Python v3.11NLTK v3.8.1。所有示例也应与新版本一起使用。

NLTK库安装

NLTK可以通过python pip安装:

python3 -m pip install nltk

几个NLTK功能需要使用其他数据,例如停止单词或集成语料库。为此,使用内置下载器。这是一个示例:

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('reuters')

其他零件,例如专门的令牌或停止单词,需要安装Java库。请参阅此Github Gist开始。

NLP任务

NLTK支持几个NLP任务。这是一个简短的概述,下一节提供了更多详细信息:

  • 文本处理
    • 令牌化
    • lemmatization
  • 文字语法
    • 词性标记
  • 文字语义
    • 命名实体识别
  • 文档语义
    • 聚类
    • 分类

此外,NLTK支持以下其他功能:

  • 数据集
  • 语料库管理
  • 机器学习聚类和分类模型

文本处理

令牌化

令牌化是文本处理中必不可少的第一步。通常,应选择令牌化方法,取决于项目要求和随后的NLP任务。例如,当文本包含代表实体或人的多名单词时,但是令牌器只是按whitespace拆分时,命名实体识别变得很难。

nltk提供了一个简单的空格令牌,几种内置的代币器,例如NISTStanford,以及基于regular expressions的自定义标记器的选项。

这是内置句子和单词令牌的示例:

from nltk.tokenize import sent_tokenize, word_tokenize

# Source: Wikipedia, Artificial Intelligence, https://en.wikipedia.org/wiki/Artificial_intelligence
paragraph = '''Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.'''

sentences = []
for sent in sent_tokenize(paragraph):
  sentences.append(word_tokenize(sent))

sentences[0]
# ['Artificial', 'intelligence', 'was', 'founded', 'as', 'an', 'academic', 'discipline'

茎和柠檬酸

喜欢象征化,选择合适的词干(用单词stem替换插入单词,例如用厨师烹饪)和lemmatization(用其引理替换单词组)方法取决于随后的NLP任务。 Lemmatization具有特殊的作用,因为它需要一些言论的标签或单词Sense歧义才能正确识别单词组。

nltk提供了几个茎模块,例如PorterLancaster 对于lemmatization,仅提供Wordnet

让我们比较有关人工智能的Wikipedia文章中第一句话的茎和诱饵。

from nltk.stem import LancasterStemmer

sent = 'Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding.'

stemmer = LancasterStemmer()

stemmed_sent = [stemmer.stem(word) for word in word_tokenize(sent)]
print(stemmed_sent)
# ['art', 'intellig', 'was', 'found', 'as', 'an', 'academ', 'disciplin',

和用WordNet lemmatizer处理的同一句子:

from nltk.stem import WordNetLemmatizer

sent = 'Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding.'

lemmatizer = WordNetLemmatizer()

lemmas = [lemmatizer.lemmatize(word) for word in word_tokenize(sent)]
print(lemmas)
# ['Artificial', 'intelligence', 'wa', 'founded', 'a', 'an', 'academic', 'discipline'

文字语法

语音的一部分标记

nltk还提供了语音标记的不同部分(POS)。使用内置标记器,产生了以下注释:

标签 含义
adj 形容词
adp adposition
adv 副词
conj 连接
det 确定程序,文章
名词 名词
num 数字
prt 粒子
pron 代词
动词 动词
标点符号
x 其他

从Wikipedia文章中获取有关人工智能的第一句话,语音标签的一部分会产生以下结果。

from nltk import pos_tag

sent = 'Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding.'

pos_tag(sentences[0])

# [('Artificial', 'JJ'),
#  ('intelligence', 'NN'),
#  ('was', 'VBD'),
#  ('founded', 'VBN'),
#  ('as', 'IN'),
#  ('an', 'DT'),
#  ('academic', 'JJ'),
#  ('discipline', 'NN'),

要使用其他NLTK POS标签器,例如StanfordBrill,需要下载外部Java库。

文本语义

命名实体识别

nltk包括预先训练的NER标签器,但首先需要下载几个套件。

import nltk
nltk.download('maxent_ne_chunker')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')

NER Tagger消耗了POS标记的句子,并将分类标签添加到表示形式。在样本段落上使用它不会产生任何结果,因此以下示例从Wikipedia文章中提到了其中的人。

from nltk.tokenize import sent_tokenize

# Source: Wikipedia, Artificial Intelligence, https://en.wikipedia.org/wiki/Artificial_intelligence
sentence= '''
In 2011, in a Jeopardy! quiz show exhibition match, IBM's question answering system, Watson, defeated the two greatest Jeopardy! champions, Brad Rutter and Ken Jennings, by a significant margin.
'''

tagged_sentence = nltk.pos_tag(word_tokenize(sentence))
tagged_sentence
# [('In', 'IN'),
#  ('2011', 'CD'),
#  (',', ','),
#  ('in', 'IN'),
#  ('a', 'DT'),
#  ('Jeopardy', 'NN'),

print(nltk.ne_chunk(tagged_sentence))
# (S
#   In/IN
#   2011/CD
#   ,/,
#   in/IN
#   a/DT
#   Jeopardy/NN
#   !/.
#   quiz/NN
#   show/NN
#   exhibition/NN
#   match/NN
#   ,/,
#   (ORGANIZATION IBM/NNP)
#   's/POS
#   question/NN
#   answering/NN
#   system/NN
#   ,/,
#   (PERSON Watson/NNP)

如您所见,Watson和组织IBM的人都被认可。

文档语义

聚类

支持三种聚类算法,请参见complete documentation

  • k-means
  • em cluster
  • 组平均聚集簇(GAAC)

分类

在NLTK中实现了分类器之后,也请参见complete documention

  • 决策树
  • 最大熵建模
  • Megam Maxent优化
  • 幼稚的贝叶斯(和变体)

支持外部软件包,例如用于语言标识的TextCat,Java库WekaSciKitLearn classifiers

附加功能

数据集

nltk提供了100多个内置语料库,请参见complete list。一些例子:路透社新闻文章,Treebank 2 Wall Street Journal校园,Twitter新闻或WordNet词汇数据库。

这是如何访问路透社的示例。

from nltk.corpus import reuters

print(reuters.categories()[:10])
#['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee']

print(reuters.fileids()[:10])
# ['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843']

sample = 'test/14829'
categories = reuters.categories(sample)

print(categories)
# ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee']

content = ""
with reuters.open(sample) as stream:
    content = stream.read()

print(f"Categories #{categories} / file #{sample}")
# Categories #['crude', 'nat-gas'] / file #test/14829

print(f"Content:\#{content}")
# Content:\#JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWARDS
# The Ministry of International Trade and
# Industry (MITI) will revise its long-term energy supply/demand
# outlook by August to meet a forecast downtrend in Japanese
# energy demand, ministry officials said.
#     MITI is expected to lower the projection for primary energy
# supplies in the year 2000 to 550 mln kilolitres (kl) from 600
# mln, they said.
#     The decision follows the emergence of structural changes in
# Japanese industry following the rise in the value of the yen
# and a decline in domestic electric power demand.
#     MITI is planning to work out a revised energy supply/demand
# outlook through deliberations of committee meetings of the
# Agency of Natural Resources and Energy, the officials said.
#     They said MITI will also review the breakdown of energy
# supply sources, including oil, nuclear, coal and natural gas.
#     Nuclear energy provided the bulk of Japan's electric power
# in the fiscal year ended March 31, supplying an estimated 27
# pct on a kilowatt/hour basis, followed by oil (23 pct) and
# liquefied natural gas (21 pct), they noted.

语料库管理

语料库读者

NLTKS读取器对象提供阅读,过滤,解码和预处理结构化文件列表或zip文件。

存在许多不同的语料库读取器对象,请参见full list。最常见的读者是:

  • PlaintextCorpusReader:阅读文本文档,其中段落被分为空白行。
  • MARKDOWN:流程标记文件,其中其类别在文件名中表示
  • 标记:期望已经标记的语料库的特殊语料库读取器对象,例如Conl。请注意,对于几个内置语料库对象,已标记为标记版本。
  • Twitter:JSON格式的过程推文
  • XML:Process XML文件

作为一个简短的例子,这是一个PlaintextCorpusReader,它将读取*.txt文件。

from  nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpus = PlaintextCorpusReader('wikipedia_articles', r'.*\.txt')

print(corpus.fileids())
# ['AI_alignment.txt', 'AI_safety.txt', 'Artificial_intelligence.txt', 'Machine_learning.txt']

print(corpus.sents())
# [['In', 'the', 'field', 'of', 'artificial', 'intelligence', '(', 'AI', '),', 'AI', 'alignment', 'research', 'aims', 'to', 'steer', 'AI', 'systems', 'towards', 'humans', '’', 'intended', 'goals', ',', 'preferences', ',', 'or', 'ethical', 'principles', '.'], ['An', 'AI', 'system', 'is', 'considered', 'aligned', 'if', 'it', 'advances', 'the', 'intended', 'objectives', '.'], ...]

文字收集

从语料库访问结构化信息的另一个实用程序是TextCollection类。在令牌化文本上实例化,它提供了以下功能:

  • collocations(num, window_size):返回到num window_size长度的num元素,词显示
  • collocation_list(num, window_size):输出串联的单词作为元组列表
  • common_contexts(word, num):打印word出现的上下文
  • concordance(word, width, lines):打印给定的word(单个单词或句子)的一致性
  • concordance_list(word, width, lines):打印一个元组
  • 的列表
  • count(word):单词的绝对外观
  • tfidftf_idf:单词频率
  • generate:基于Trigram语言模型创建随机文本。
  • vocab:所有令牌的频率分布
  • plot:绘制频率分布

这是一个示例:

from  nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.text import TextCollection

corpus = PlaintextCorpusReader('wikipedia_articles', r'.*\.txt')
col = TextCollection(corpus.sents())

print(col.count('the'))
# 973

print(col.common_contexts(['intelligence']))
# artificial_( general_( artificial_. artificial_is general_,
# artificial_, artificial_in artificial_". artificial_and "_"
# artificial_was general_and general_. artificial_; artificial_" of_or
# artificial_– artificial_to artificial_: and_.

机器学习聚类和分类模型

NLTK提供了几种聚类和分类算法。但是在使用任何算法之前,需要从文本中手动设计和提取功能。

在有关classification的API文档页面上,这些步骤定义如下:

  • 定义与ML任务相关的功能
  • 实现从CORPORA中提取功能的方法(例如,文档中的单词频率)
  • 创建一个python词典对象,该对象包含带有(feature_name, labels)的单个元素并将其传递到训练算法

让我们用NLTK Handbook的示例来说明这一点,以构建文本分类器。

首先,我们构建了所有单词的词汇:

from  nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus = PlaintextCorpusReader('wikipedia_articles', r'.*\.txt')

vocab = nltk.FreqDist(w.lower() for w in corpus.words())
#  FreqDist({'the': 65590, ',': 63310, '.': 52247, 'of': 39000, 'and': 30868, 'a': 30130, 'to': 27881, 'in': 24501, '-': 19867, '(': 18243, ...})

all_words = nltk.FreqDist(w.lower() for w in corpus.words())
word_features = list(all_words)
# ['the', ':']

第二,我们定义了一种返回单式编码的单词向量的方法,该方法表示文档中是否存在单词。由此产生的特征向量必须包含布尔值,以使其可用于分类任务。

def document_features(document):
    document_words = set(corpus.words(document))
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

f = document_features('Artificial_intelligence.txt')
# {'contains(the)': True,
#  'contains(,)': True,
#  'contains(.)': True,

第三,我们选择一个分类算法并将特征文档传递给它。

featuresets = [(document_features(d), d) for d in corpus.fileids()]
featuresets
# featuresets = [(document_features(d), d) for d in corpus.fileids()]

train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
# <nltk.classify.naivebayes.NaiveBayesClassifier at 0x185ec5dd0>

概括

nltk是一个支持多个NLP任务的多功能库。对于象征化,诱惑/诱饵的核心任务以及语音标记的一部分,内置方法以及科学论文的方法包括。为了管理文档语料库,NLTK处理文本,降价,XML和其他格式,并为获取文件,类别,句子和单词提供了API。 TextCollection类特别有用,它可以使单词搭配和计算术语频率收集。最后,NLTK还提供了聚类和分类算法,例如Kmeans,决策树或天真的贝叶斯。