Python NLP库：Spacy-DEV365 开发者社区

使用Spacy，一个复杂的NLP库，可以使用各种NLP任务的训练有素的模型。从象征化到言论的一部分标记再到实体识别，Spacy也会产生精心设计的Python数据结构和强大的可视化。最重要的是，可以加载和微调不同的语言模型，以适应特定域中的NLP任务。最后，Spacy提供了强大的管道对象，促进混合内置和自定义令牌，解析器，标记器和其他组件来创建支持所有所需NLP任务的语言模型。

本文介绍了Spacy。您将学习如何安装库，加载模型并应用文本处理和文本语义任务，最后如何自定义Spacy模型。

本文的技术背景是Python v3.11和Spacy v3.5.3。所有示例也应与新版本一起使用。

本文最初出现在我的博客admantium.com 。

Spacy库安装

可以通过pip安装Spacy库：

python3 -m pip install spacy

所有NLP任务都要求首先加载模型。 Spacy提供的模型也以不同的语料库和不同语言为基础。参见full list of models。通常，可以将模型区分为其语料库的大小，该模型在NLP任务过程中会导致不同的结果，以及用于构建模型的技术，该模型是一种不知道的内部格式或基于变压器的模型，例如Berta。<<<<<<<<<<<<< /p>

要加载特定模型，可以使用以下摘要：

python -m spacy download en_core_web_lg

NLP任务

Spacy支持以下任务：

文本处理
- 令牌化
- lemmatization
文字语法
- 词性标记
文字语义
- 依赖性解析
- 命名实体识别
文档语义
- 分类

此外，Spacy支持以下其他功能：

语料库管理
字向量
自定义NLP管道
模型培训

文本处理

使用验证的语言模型之一时，会自动应用文本处理必需品。从技术上讲，文本处理发生在可配置的管道对象周围，这是类似于Scikit Learn Pipeline对象的抽象。该处理始终始于令牌化，然后添加其他数据结构，以丰富解析文本的信息。所有这些任务也可以自定义，例如交换标记组件。以下描述仅着眼于使用预训练模型时的内置功能。</p>

令牌化

令牌化是第一个步骤，直接应用：将加载模型应用于文本，并出现令牌。

import spacy
nlp = spacy.load('en_core_web_lg')

# Source: Wikipedia, Artificial Intelligence, https://en.wikipedia.org/wiki/Artificial_intelligence
paragraph = '''Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.'''

doc = nlp(paragraph)
tokens = [token for token in doc]

print(tokens)
# [Artificial, intelligence, was, founded, as, an, academic, discipline

诱饵

引发是自动生成的；它们是令牌的特性。

doc = nlp(paragraph)

lemmas = [token.lemma_ for token in doc]

print(lemmas)
# ['artificial', 'intelligence', 'be', 'found', 'as', 'an', 'academic',

可配置的组件在rules或lookup上应用lemmatization。要查看内置模型使用的模式，请执行以下代码：

lemmatizer = nlp.get_pipe("lemmatizer")

print(lemmatizer.mode)
# 'rule'

spacy中没有茎。

文字语法

语音的一部分标记

在Spacy中，言论的一部分标签有两种口味。 POS属性是代币所属的绝对类别，其特征为universal POS tag。 TAG属性是一个更细微的类别，它建立在依赖性解析和命名实体识别的基础上。

下表列出了POS类。

令牌	描述
adj	形容词
adp	adposition
adv	副词
to	辅助
cconj	协调连接
det	确定器
intj	插入
名词	名词
num	数字
部分	粒子
pron	代词
PROPN	适当名词
点	标点
Sconj	从属连接
SYM	符号
动词	动词
x	其他

对于TAG类，我找不到文档中的确定说明。但是，这种Stackoverflow thread暗示TAG是指有关依赖解析的学术论文中使用的类。

要查看与令牌关联的POS和TAG，请运行以下代码：

doc = nlp(paragraph)
for token in doc:
    print(f'{token.text:<20}{token.pos_:>5}{token.tag_:>5}')

#Artificial            ADJ   JJ
#intelligence         NOUN   NN
#was                   AUX  VBD
#founded              VERB  VBN
#as                    ADP   IN
#an                    DET   DT
#academic              ADJ   JJ
#discipline           NOUN   NN

文本语义

依赖解析

依赖解析检查单词和大量单词的上下文关系。此步骤大大增强了从文本中易于提高的机器。

spacy既提供了文本表示形式，也提供了依赖关系的图形表示。

doc = nlp(paragraph)
for token in doc:
    print(f'{token.text:<20}{token.dep_:<15}{token.head.text:<20}')

# Artificial          amod           intelligence
# intelligence        nsubjpass      founded
# was                 auxpass        founded
# founded             ROOT           founded
# as                  prep           founded
# an                  det            discipline
# academic            amod           discipline
# discipline          pobj           as
# in                  prep           founded

要以图形方式渲染这些关系，请运行以下命令。

from spacy import displacy

nlp = spacy.load("en_core_web_lg")

displacy.serve(doc, style="dep", options={"fine_grained": True, "compact": True})

它将输出一个如下所示的结构：

请注意，此处理步骤的功能仅限于语言模型原始培训语料库。 Spacy提供了两种增强解析的方法。首先，模型可以从头开始训练。其次，最近的Spacy发行版提供了变压器模型，并且可以对这些模型进行微调，以与更特定领域的Corpora一起使用。

命名实体识别

文本中的实体是指人员，组织或对象，可以通过Spacy检测到已处理的文档。

公认的实体是解析文档的一部分，可以通过ents属性访问。

doc = nlp(paragraph)
for token in doc.ents:
    print(f'{token.text:<40}{token.label_:<15}')

# 1956                                    DATE
# the years                               DATE
# AI                                      ORG
# the first decades of the 21st century   DATE

另外，它们可以可视化。

类似于依赖性解析，此步骤的结果非常取决于其训练数据。例如，如果在书本上使用，它可能无法识别其字符的名称。为了帮助这种情况，可以创建自定义的KnowledgeBase object，该案例将用于确定文本处理过程中指定实体的可能候选者。

文档语义

分类

Spacy本身不包括分类或分类算法，但是其他开源项目扩展了执行机器学习任务的Spacy。

仅显示一个示例：Berttopic扩展程序是一个开箱即用的文档分类项目，甚至提供了视觉表示。

该项目是通过运行pip install "bertopic[spacy]"安装的。将此项目应用于200篇文章，给出以下结果：

import numpy as np
import pandas as pd
from bertopic import BERTopic

X = pd.read_pickle('ml29_01.pkl')
docs = X['preprocessed'].values

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

print(topic_model.get_topic_info())
# Topic Count Name
# -1 30 -1_artificial_intelligence_machine
# 1 22 49_space_lunar_mission

附加功能

语料库管理

spacy定义了一个Corpus对象，但它用于读取JSON或plaintext files用于培训自定义Spacy语言模型。

我在文档中可以找到的所有处理过的文本的唯一属性是vocab，这是一个在处理后文本中遇到的所有单词的查找表。

文本向量

对于类别md或lg的所有内置模型，包括词向量。在Spacy中，可以将单个令牌，跨度（文档的用户定义切片）或完整的文档表示为向量。

这是一个示例：

nlp = spacy.load("en_core_web_lg")
vectors = [(token.text, token.vector_norm) for token in doc if token.has_vector]

print(vectors)
# [('Artificial', 8.92717), ('intelligence', 6.9436903), ('was', 10.1967945), ('founded', 8.210244), ('as', 7.7554812), ('an', 8.042635), ('academic', 8.340115), ('discipline', 6.620854),

span = doc[0:10]

print(span)
# Artificial intelligence was founded as an academic discipline in 1956

print(span.vector_norm)
# 3.0066288

print(doc.vector_norm)
# 2.037331438809547

该文档未披露正在使用哪种特定令牌化方法。非归一化令牌具有300个维度，这可能暗示正在使用FastText令牌化方法：

token = doc[0]

print(token.vector.dtype, token.vector.shape)
# float32 (300,)

print((token.text, token.vector))
#'Artificial',
# array([-1.6952  , -1.5868  ,  2.6415  ,  1.4848  ,  2.3921  , -1.8911  ,
#         1.0618  ,  1.4815  , -2.4829  , -0.6737  ,  4.7181  ,  0.92018 ,
#        -3.1759  , -1.7126  ,  1.8738  ,  3.9971  ,  4.8884  ,  1.2651  ,
#         0.067348, -2.0842  , -0.91348 ,  2.5103  , -2.8926  ,  0.92028 ,
#         0.24271 ,  0.65422 ,  0.98157 , -2.7082  ,  0.055832,  2.2011  ,
#        -1.8091  ,  0.10762 ,  0.58432 ,  0.18175 ,  0.8636  , -2.9986  ,
#         4.1576  ,  0.69078 , -1.641   , -0.9626  ,  2.6582  ,  1.2442  ,
#        -1.7863  ,  2.621   , -5.8022  ,  3.4996  ,  2.2065  , -0.6505  ,
#         0.87368 , -4.4462  , -0.47228 ,  1.7362  , -2.1957  , -1.4855  ,
#        -3.2305  ,  4.9904  , -0.99718 ,  0.52584 ,  1.0741  , -0.53208 ,
#         3.2444  ,  1.8493  ,  0.22784 ,  0.67526 ,  2.5435  , -0.54488 ,
#        -1.3659  , -4.7399  ,  1.8076  , -1.4879  , -1.1604  ,  0.82441 ,

最后，Spacy提供了为管道提供user-defined word vectors的选项。

自定义NLP管道

管道的参考模型包含在验证的语言模型中。他们包括以下内容：

tokenizer
Tagger
依赖性解析器
实体识别器
lemmatizer

需要在特定于项目的配置文件中定义管道步骤。所有这些步骤的完整管道定义如下：

[nlp]
pipeline = ["tok2vec", "tagger", "parser", "ner", "lemmatizer"]

可以扩展此管道。例如，要添加文本分类步骤，只需使用textcat附加管道声明，并提供实现。

可以使用几种pipeline components，以及添加custom components的选项，这些选项将与Spacy相互作用并丰富了通过Spacy实现的文档表示。

此外，可以将外部自定义模型或模型组件合并到Spacy中，例如交换词向量表示或使用基于Pytorch或外部transformers library的任何transformer-based models。

模型培训

基于管道对象和广泛的配置文件，可以训练新型号。 Quickstart guide包括一个UI小部件，其中可以交互创建所需的管道步骤和本地文件路径。培训数据需要具有Python数据对象的形式，其中包括应培训的所需属性。这是官方文档的示例，用于培训名称实体识别。

# Source: Spacy, Training Pipelines & Models, https://spacy.io/usage/training#training-data

nlp = spacy.blank("en")
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
]

对于培训数据本身，支持了几个converters，例如JSON，Universal依赖项或命名的实体识别格式，例如IOB或Biluo。

为了促进培训，Corpus对象可以用来迭代JSON或明文文档。

概括

Spacy是最先进的NLP库。通过使用一个预训练的模型之一，应用了所有基本的文本处理任务（象征化，窃听，词性标签）和文本语义任务（依赖关系解析，命名实体识别）。所有模型均通过管道对象创建，并且该对象可用于自定义这些步骤中的任何一个，例如提供自定义令牌或交换字矢量组件。此外，您可以使用其他基于变压器的模型，并通过文本分类的任务扩展管道。