创建带有美丽汤的网络报废者-DEV365 开发者社区

OPA，EEE？

我最近必须制作一个简单的网络刮刀（最好是粗略地），并遇到了我用作硒和scrapy andBomâ的旧数据刮擦技术。它们不是很简单。

如果您已经使用了Selenium知道必须安装Web驱动程序的困难和实用性，那么在浏览器中运行机器人不是很优雅，更不用说性能。
砂纸确实是一种非常强大的网络刮擦工具，它不是很实用，只需查看在这种情况下，它生成了多少个文件。

那是我最终遇到美丽汤的时候。

美丽的汤

美丽的汤是用于HTML和XML文档的Python图书馆。他所做的就是将文件的遏制转换为文件文件，因此它可以通过母亲©所有LIB搜索并从HTML代码进行修改摘录。

。

要使用此lib，我们需要我们的python数据包管理器pip并在最亲密的终端运行以下命令：

pip install beautifulsoup4

就是这样！没有项目是巨大的，还有300个文件。让我们来Codar！

从刮擦引号中提取数据

对于那些不知道著名人物的名言网站的人，精确地训练了刮擦。它实际上并没有提出刮擦更复杂的网站的工作，也许它们试图防止自动化，而是在这个世界上开始的最初锻炼。

此处的目标将被搜索所有页面（总共10页），并向阿尔伯特·爱因斯坦（Albert Einstein）寻找所有报价。
我不会做悬念，然后遵循：

from bs4 import BeautifulSoup
import requests

# URL of the page we want to scrape
url = "https://quotes.toscrape.com/page/"

initial_page = 1;
end_page = 10;

author = "Albert Einstein"

quotes = []

# Loop through the pages
for page in range(initial_page, end_page):

    # Get the HTML content
    response = requests.get(url + str(page))

    # Create a BeautifulSoup object
    soup = BeautifulSoup(response.text, "html.parser")

    # Get the quotes
    page_quotes = soup.find_all("div", class_="quote")

    # Verify if the author is in the quote and save it
    for quote in page_quotes:
        if (quote.find("small", class_="author").text == author):
            quote_text = quote.find("span", class_="text").text

            quotes.append(quote_text)
            print("Quote found: " + quote_text)


print("Number of quotes: " + str(len(quotes)))