Pyspark：Bram Stoker的Dracula中最常见的单词的简要分析-DEV365 开发者社区

注意：本文也可以在portugueseð。
中找到

由布拉姆·斯托克（Bram Stoker）于1897年撰写的标志性小说《德古拉》（Dracula）是哥特文学中的地标，这激起了世界各地人们的情感。今天，为了介绍Spark的新概念和功能，我们将开发一个简短的笔记本，以分析这本经典书中最常见的单词ð§§ð¼âââââ。

为此，我们将在Google Colab上写一本笔记本，这是Google构建的云服务，旨在鼓励机器学习和人工智能研究。

此笔记本也可以在我的GitHubð。

中找到

这部小说是通过Project Gutenberg获得的，Project Gutenberg是一个集中了世界各地公共书籍的数字图书馆。

开始之前

开始之前，我们需要安装PySpark库。

Pyspark是Python的Apache Spark的官方API。我们将使用ITð²进行数据分析。

因此，在COLAB中创建一个新的代码单元，并添加以下行：

!pip install pyspark

第一步：运行Apache Spark

安装完成后，我们需要运行Apache Spark。为此，创建一个新的代码单元格并添加以下代码块：

         from pyspark.sql import SparkSession

spark = (SparkSession.builder
         .appName("The top most common words in Dracula, by Bram Stoker")
         .getOrCreate()
         )

第二步：下载和阅读

在此步骤中，我们将从Guttenberg Project下载小说，然后使用Pyspark加载它。

我们将使用 wget 工具来执行此操作，将其传递给它并将其保存在本地目录中，然后将其重命名为 dracula bram stoker.txt 。

再次在COLAB中创建一个新的代码单元格，并添加以下代码行：

!wget https: // www.gutenberg.org/cache/epub/345/pg345.txt -O "Dracula - Bram Stoker.txt"

第三步：停止词下载

在本节中，我们将下载英语使用的停止词列表。这些停止单词通常包括介词，粒子，插入，工会，副词，代词，介绍性词，0到9（明确）的数字，其他经常使用的官方官方，语音的独立部分，符号，标点符号。相对较新，此列表是由诸如www，com，http等符号序列上常用的列表补充。

此列表是通过CountWordsFree获得的，CountWordsFree网站将全世界多种语言使用的停止词汇总在一起。

上班！在Colab中创建一个新的代码单元格，并添加以下代码行：

!wget https://countwordsfree.com/stopwords/english/txt -O "stop_words_english.txt"

之后，让我们使用Spark加载这本书。创建一个新的代码单元格并添加以下代码块：

book = spark.read.text("Dracula - Bram Stoker.txt")

，让我们也加载止动物。停止字将存储在列表中， stopwords actible。

with open("stop_words_english.txt", "r") as f:
    text = f.read()
    stopwords = text.splitlines()

len(stopwords), stopwords[:15]

输出

(851,
 ['able',
  'about',
  'above',
  'abroad',
  'according',
  'accordingly',
  'across',
  'actually',
  'adj',
  'after',
  'afterwards',
  'again',
  'against',
  'ago',
  'ahead']t)

第四步：提取单词

加载完成后，我们需要将单词提取到数据框列。

要这样做，使用 split 函数，将使用它们之间的空格分开。结果是单词列表。

from pyspark.sql.functions import split

lines = book.select(split(book.value, " ").alias("line"))
lines.show(5)

输出

+--------------------+
|                line|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[most, other, par...|
|[whatsoever., You...|
+--------------------+
only showing top 5 rows

第五步：爆炸列表单词

现在，让s在数据帧列中转换此单词列表，使用爆炸函数。

from pyspark.sql.functions import explode, col

words = lines.select(explode(col("line")).alias("word"))
words.show(15)

输出

+---------+
|     word|
+---------+
|      The|
|  Project|
|Gutenberg|
|    eBook|
|       of|
| Dracula,|
|       by|
|     Bram|
|   Stoker|
|         |
|     This|
|    eBook|
|       is|
|      for|
|      the|
+---------+
only showing top 15 rows

第六步：小写的单词

这是一个简单的步骤。由于大写字母，我们不希望相同的单词有所不同，因此我们使用 soulth 函数将这些单词转换为小写。

from pyspark.sql.functions import lower

words_lower = words.select(lower(col("word")).alias("word_lower"))
words_lower.show()

输出

+----------+
|word_lower|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|  dracula,|
|        by|
|      bram|
|    stoker|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
|  anywhere|
|        in|
+----------+
only showing top 20 rows

第七步：删除标点符号

因此，由于它们的结尾标点符号，因此相同的单词是没有什么不同的。

我们将使用 regexp_extract 函数来做到这一点，该功能使用正则函数从字符串中提取单词。

from pyspark.sql.functions import regexp_extract

words_clean = words_lower.select(
    regexp_extract(col("word_lower"), "[a-z]+", 0).alias("word")
)

words_clean.show()

输出

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|  dracula|
|       by|
|     bram|
|   stoker|
|         |
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
|       in|
+---------+
only showing top 20 rows

第八步：删除空值

但是，您如何看待，换句话说，有空值。

有必要删除它们，以便未分析这些空白值。

words_nonull = words_clean.filter(col("word") != "")
words_nonull.show()

输出

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|  dracula|
|       by|
|     bram|
|   stoker|
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
|       in|
|      the|
+---------+
only showing top 20 rows

步骤九：删除停止词

我们快到了！最后一步是删除停止词，以便再次分析这些单词。

words_without_stopwords = words_nonull.filter(
    ~words_nonull.word.isin(stopwords))

words_count_before_removing = words_nonull.count()
words_count_after_removing = words_without_stopwords.count()

words_count_before_removing, words_count_after_removing

输出

(163399, 50222)

步骤十：最终分析德古拉最常见的单词！

，最后，我们的数据被完全清除。因此，现在我们可以分析书中最常见的单词。

首先，我们将单词分组，并在使用聚合函数来计数它们。

words_count = (words_without_stopwords.groupby("word")
               .count()
               .orderBy("count", ascending=False)
               )

之后，显示前20个最常见的单词。该值可以通过等级变量更改。

rank = 20
words_count.show(rank)

输出

+--------+-----+
|    word|count|
+--------+-----+
|    time|  381|
| helsing|  323|
|     van|  322|
|    lucy|  297|
|    good|  256|
|     man|  255|
|    mina|  240|
|    dear|  224|
|   night|  224|
|    hand|  209|
|    room|  207|
|    face|  206|
|jonathan|  206|
|   count|  197|
|    door|  197|
|   sleep|  192|
|    poor|  191|
|    eyes|  188|
|    work|  188|
|      dr|  187|
+--------+-----+
only showing top 20 rows

结论

现在，这一切，伙计们！在本文中，我们分析了Bram Stoker撰写的Dracula中最常见的词。为此，我们清除了这些词：删除标点符号；从大写字母转换为小写；并删除停止词。

希望您喜欢它。保持这些赌注锋利，当心晚上走路的阴影，下次见到您ð§§ð¼âââââââ·ð·。

参考书目

riux，乔纳森。 Data Analysis with Python and PySpark。

Staker，Bram。 Dracula。