X-Crawl

X-Crawl是一个灵活的Nodejs爬网库。它可以爬网页，控制页，批处理网络请求，批处理下载文件资源，轮询和爬网等。支持异步/同步模式爬网数据。在nodejs上运行，用法灵活且简单，对JS/TS开发人员友好。

如果您感觉良好，可以给x-crawl repository一个恒星支持它，您的明星将成为我更新的动力。

特征

支持异步/同步爬网的方式。
灵活的写作，支持多种编写请求配置的方法并获得爬行结果。
灵活的爬行间隔，没有间隔/固定间隔/随机间隔，您可以使用/避免并发同时发生。
简单的配置可以爬网，批处理网络请求，批处理下载文件资源，轮询和爬行等
爬网SPA（单页应用程序）生成预渲染的内容（即“ ssr”（服务器端渲染）），并使用JSDOM库来解析内容，并支持自我放松。
表格提交，击键，事件动作，生成页面的屏幕截图等
捕获并记录爬行的成功和失败，并突出提醒。
用打字稿编写，具有类型，提供通用。

例子

时序捕获：每天以Airbnb Plus清单的封面图像为例：

// 1.Import module ES/CJS
import xCrawl from 'x-crawl'

// 2.Create a crawler instance
const myXCrawl = xCrawl({
  timeout: 10000, // overtime time
  intervalTime: { max: 3000, min: 2000 } // crawl interval
})

// 3.Set the crawling task
/* 
  Call the startPolling API to start the polling function, 
  and the callback function will be called every other day
*/
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
  // Call crawlPage API to crawl Page
  const { jsdom } = await myXCrawl.crawlPage('https://zh.airbnb.com/s/*/plus_homes')

  // Get the cover image elements for Plus listings
  const imgEls = jsdom.window.document
    .querySelector('.a1stauiv')
    ?.querySelectorAll('picture img')

  // set request configuration
  const requestConfig: string[] = []
  imgEls?.forEach((item) => requestConfig.push(item.src))

  // Call the crawlFile API to crawl pictures
  myXCrawl.crawlFile({ requestConfig, fileConfig: { storeDir: './upload' } })
})

运行结果：

注意：不要随意爬行，您可以在爬行前检查 robots.txt 协议。这只是为了演示如何使用X-Crawl。

X-Crawl

特征

例子

更多的