ð介绍 - bose框架 - 机器人开发商的瑞士军刀-DEV365 开发者社区

机器人开发很艰难。

像CloudFlare这样的机器人探测器已准备好捍卫网站免受机器人的影响。用Chromeottions配置硒来指定驱动程序路径，配置文件，用户代理和窗口大小很麻烦，并且在Windows中的噩梦。通过日志调试机器人崩溃很难。您如何在不牺牲速度和方便发展的情况下解决这些疼痛点？

输入Bose。 Bose是开发人员社区中的第一个机器人开发框架，该框架是专门设计的，旨在为机器人开发人员提供最佳的开发人员体验。它由硒支持，提供了一系列功能，以简化机器人开发过程。就我们的知识而言，Bose是该镇同类的第一个机器人开发框架。

入门

克隆起动器模板

git clone https://github.com/omkarcloud/bose-starter my-bose-project

然后更改为该目录，安装依赖项并启动项目：

cd my-bose-project
python -m pip install -r requirements.txt
python main.py

第一次运行将花费一些时间，因为它下载了Chrome驱动程序可执行文件，随后的运行将很快。

核心功能

添加了强大的方法，使与硒的合作变得更加容易。
遵循最佳实践，以避免通过Cloudflare和Courmeterx检测机器人检测。
保存HTML，屏幕截图和每个任务运行的运行详细信息，以启用轻松调试。
实用程序组件以JSON，CSV和EXCEL文件编写刮擦数据。
自动下载并初始化正确的Chrome驱动程序。
快速而开发人员友好。

用法

说您想开始刮擦网站。如果您使用裸露的硒，则必须处理以下打开和关闭驾驶员的命令：

from selenium import webdriver

driver_path = 'path/to/chromedriver'

driver = webdriver.Chrome(executable_path=driver_path)

driver.get('https://www.example.com')

driver.quit()

但是，使用Bose框架，您可以采用声明性和结构化的方法。您只需要编写以下代码，而Bose驱动程序将负责创建驱动程序，将其传递到任务的 run 方法，然后关闭驱动程序：

from bose import *

class Task(BaseTask):
    def run(self, driver):
        driver.get('https://www.example.com')

配置

在Bare Selenium中，如果要配置诸如配置文件，用户代理或窗口大小之类的选项，则需要编写大量代码，如下所示：

from selenium.webdriver.chrome.options import Options
from selenium import webdriver

driver_path = 'path/to/chromedriver.exe'

options = Options()

profile_path = '1'

options.add_argument(f'--user-data-dir={profile_path}')

user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.37")'
options.add_argument(f'--user-agent={user_agent}')

window_width = 1200
window_height = 720
options.add_argument(f'--window-size={window_width},{window_height}')

driver = webdriver.Chrome(executable_path=driver_path, options=options)

另一方面，Bose框架通过将浏览器配置封装在任务的 BrowserConfig 属性中来简化这些复杂性，如下所示：

from bose import BaseTask, BrowserConfig, UserAgent, WindowSize

class Task(BaseTask):
    browser_config = BrowserConfig(user_agent=UserAgent.user_agent_106, window_size=WindowSize.window_size_1280_720, profile=1)

异常处理

使用硒时常见的例外是常见的。在裸露的硒中，如果发生例外，则驱动程序会自动关闭，使您只有日志进行调试。

在Bose中，当刮擦任务中发生异常时，浏览器保持打开状态而不是立即关闭。这使您可以在发生异常时看到实时浏览器状态，这极大地有助于调试。

调试

Web刮擦通常会充满错误，例如错误的选择器或无法加载的页面。与RAW Selenium进行调试时，您可能必须筛选日志以识别问题。幸运的是，Bose使您可以通过存储有关每次运行的信息进行调试变得简单。

每次运行后，在包含三个文件的任务中创建了一个目录，这些文件如下：

`task_info.json`

它包含有关任务运行的信息，例如任务运行的持续时间，任务的IP详细信息，用户代理，window_size和用于执行任务的配置文件。

`final.png`

这是关闭驾驶员之前捕获的屏幕截图。

`page.html`

这是驾驶员关闭之前捕获的HTML源。如果您的选择器未能选择元素，则非常有用。

`error.log`

以防您的任务因异常而崩溃，我们还存储了错误。log包含该任务崩溃的错误。这是在调试中非常有效的。

输出数据

执行Web刮擦后，我们需要以JSON或CSV格式存储数据。通常，此过程涉及编写大量的命令式代码：

import csv
import json

def write_json(data, filename):
    with open(filename, 'w') as fp:
        json.dump(data, fp, indent=4)

def write_csv(data, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = data[0].keys()  # get the fieldnames from the first dictionary
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()  # write the header row
        writer.writerows(data)  # write each row of data

data = [
    {
        "text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d",
        "author": "Albert Einstein"
    },
    {
        "text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d",
        "author": "J.K. Rowling"
    }
]

write_json(data, "data.json")
write_csv(data, "data.csv")

Bose通过将它们封装在输出模块中以读取和编写数据来简化这些复杂性。

要使用输出方法，请调用要保存的文件类型的write方法。

所有数据都将保存在output/文件夹中：

请参阅以下代码以获取参考

from bose import Output

data = [
    {
        "text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d",
        "author": "Albert Einstein"
    },
    {
        "text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d",
        "author": "J.K. Rowling"
    }
]

Output.write_json(data, "data.json")
Output.write_csv(data, "data.csv")

未发现的驱动程序

Ultrafunkamsterdam创建了一个ChromeDriver，它对绕过所有主要的机器人检测系统 （例如Distil，Datadome，Cloudflare等）有很好的支持。

Bose认识到绕过机器人检测的重要性，并为Ultrafunkamsterdam’s Undetected Driver

提供了内在支持

在Bose框架中使用未检测到的驱动程序就像将 use_undetected_driver 选项传递给 BrowserConfig 一样简单，就像：

from bose import BaseTask, BrowserConfig

class Task(BaseTask):
    browser_config = BrowserConfig(use_undetected_driver=True)

localstorage

就像现代浏览器如何具有本地存储模块一样，Bose也将相同的概念纳入其框架中。

您可以从浏览器运行中将localstorage对象导入持久数据，这在刮擦大量数据时非常有用。

数据存储在您项目的根目录中的名为“ local_storage.json”的文件中。您可以使用它：

from bose import LocalStorage

LocalStorage.set_item("pages", 5)
print(LocalStorage.get_item("pages"))

老板司机

您在任务的 run 方法中收到的驱动程序是Selenium的扩展版本，它添加了强大的方法，使使用硒更加容易。 BOSE框架中添加到Selenium驱动程序中的一些流行方法是：

的字典” 之间中

方法	描述
get_by_current_page_referrer（link，wait = none）	模拟访问，就像您通过单击链接到达页面时一样。这种方法创造了更自然，更较少可检测到的浏览行为。
JS_CLICK（element）	使您可以使用JavaScript单击一个元素，绕过弹出窗口或警报的任何截距（ElementClickInterceptedException）
get_cookies_and_local_storage_dict（）	返回包含“ cookie”和“local_storageâ
add_cookies_and_local_storage_dict（self，site_data）	将cookie和本地存储数据同时添加到当前网站
有机get（链接，wait = none）	访问Google，然后访问链接，从而使其更少可检测到
local_storage	返回以易于使用方式与浏览器的本地存储进行交互的localstorage模块的实例
save_screenshot（filename = none）	将当前网页的屏幕截图保存到任务/目录中的文件
short_random_sleep（）和long_random_sleep（）：	随机睡觉，在2到4秒（短）之间或6到9秒之间（长）
get_element_or_* [eg：get_element_or_none，get_element_or_none_by_selector，get_element_by_id，get_element_or_or_none_by_text_contains，]	根据不同的标准在页面上找到网络元素。他们返回Web元素（如果存在），或者不存在。
is_in_page（target，wait = none，rish_exception = false）	检查浏览器是否在指定的页面

Bose是一个很好的框架，简化了硒和网络刮擦的无聊部分。

祝您好运，并希望使用Bose Framework开发快乐的机器人！

了解更多

要详细了解Bose Bot开发框架，请阅读https://www.omkar.cloud/bose/的Bose Docs

入门

核心功能

用法

配置

异常处理

调试

task_info.json

final.png

page.html

error.log