网络刮擦Google Play应用程序评论
#node #webscraping #serpapi

将被刮擦

what

ð注意:您可以使用官方的Google Play Developer API,该200,000 requests per day retrieving the list of reviews and individual reviews的默认限制。

另外,您可以使用完整的第三方Google Play商店应用程序刮擦解决方案google-play-scraper。第三方解决方案通常用于打破配额限制。

这篇博客文章旨在为如何使用Puppeteer刮擦Google Play商店应用程序评论以自己创建某些东西来创建某些内容。

完整代码

如果您不需要解释,请看一下the full code example in the online IDE

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

puppeteer.use(StealthPlugin());

const reviewsLimit = 100; // hardcoded limit for demonstration purpose

const searchParams = {
  id: "com.discord", // Parameter defines the ID of a product you want to get the results for
  hl: "en", // Parameter defines the language to use for the Google search
  gl: "us", // parameter defines the country to use for the Google search
};

const URL = `https://play.google.com/store/apps/details?id=${searchParams.id}&hl=${searchParams.hl}&gl=${searchParams.gl}`;

async function scrollPage(page, clickElement, scrollContainer) {
  let lastHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
  while (true) {
    await page.click(clickElement);
    await page.waitForTimeout(500);
    await page.keyboard.press("End");
    await page.waitForTimeout(2000);
    let newHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
    const reviews = await page.$$(".RHo1pe");
    if (newHeight === lastHeight || reviews.length > reviewsLimit) {
      break;
    }
    lastHeight = newHeight;
  }
}

async function getReviewsFromPage(page) {
  return await page.evaluate(() => ({
    reviews: Array.from(document.querySelectorAll(".RHo1pe")).map((el) => ({
      title: el.querySelector(".X5PpBb")?.textContent.trim(),
      avatar: el.querySelector(".gSGphe > img")?.getAttribute("srcset")?.slice(0, -3),
      rating: parseInt(el.querySelector(".Jx4nYe > div")?.getAttribute("aria-label")?.slice(6)),
      snippet: el.querySelector(".h3YV2d")?.textContent.trim(),
      likes: parseInt(el.querySelector(".AJTPZc")?.textContent.trim()) || "No likes",
      date: el.querySelector(".bp9Aid")?.textContent.trim(),
      response: {
        title: el.querySelector(".ocpBU .I6j64d")?.textContent.trim(),
        snippet: el.querySelector(".ocpBU .ras4vb")?.textContent.trim(),
        date: el.querySelector(".ocpBU .I9Jtec")?.textContent.trim(),
      },
    })),
  }));
}

async function getAppReviews() {
  const browser = await puppeteer.launch({
    headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();

  await page.setDefaultNavigationTimeout(60000);
  await page.goto(URL);

  await page.waitForSelector(".qZmL0");

  const moreReviewButton = await page.$("c-wiz[jsrenderer='C7s1K'] .VMq4uf button");

  if (moreReviewButton) {
    await page.click("c-wiz[jsrenderer='C7s1K'] .VMq4uf button");
    await page.waitForSelector(".RHo1pe .h3YV2d");
    await scrollPage(page, ".RHo1pe .h3YV2d", ".odk6He");
  }
  const reviews = await getReviewsFromPage(page);

  await browser.close();

  return reviews;
}

getAppReviews().then((result) => console.dir(result, { depth: null }));

准备

首先,我们需要创建一个node.js* project并添加koude0koude1koude2koude3以控制Chromium(或Chrome或Firefox,但现在我们仅在DevTools Protocol上使用铬在headless或无头模式中。

为此,在我们项目的目录中,打开命令行并输入:

$ npm init -y

,然后:

$ npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth

*如果您没有安装node.js,则可以download it from nodejs.org并遵循安装documentation

ð注意:另外,您可以使用puppeteer无需任何扩展即可,但是我强烈建议将其与puppeteer-extra一起使用puppeteer-extra-plugin-stealth,以防止您使用无头铬或正在使用web driver的网站检测。您可以在Chrome headless tests website上检查它。下面的屏幕截图显示了差异。

stealth

Process

首先,我们需要滚动所有游戏列表,直到没有更多的列表加载,这是下面描述的困难部分。

下一步是在滚动完成后从HTML元素中提取数据。通过SelectorGadget Chrome extension,获得合适的CSS选择器的过程非常容易,该过程能够通过单击浏览器中的所需元素来获取CSS选择器。但是,它并不总是完美地工作,尤其是当JavaScript大量使用该网站时。

如果您想了解更多有关它们的信息,我们在Serpapi上有专门的Web Scraping with CSS Selectors博客文章。

下面的GIF说明了使用Selectorgadget选择结果的不同部分的方法。

how

代码说明

声明koude1puppeteer-extra库和koude9控制Chromium浏览器,以防止网站检测到您正在使用puppeteer-extra-plugin-stealth库中使用web driver

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

接下来,我们“说” puppeteer使用StealthPlugin,编写必要的请求参数,搜索URL并设置我们要接收多少评论(reviewsLimit常数):

puppeteer.use(StealthPlugin());

const reviewsLimit = 100; // hardcoded limit for demonstration purpose

const searchParams = {
  id: "com.discord", // Parameter defines the ID of a product you want to get the results for
  hl: "en", // Parameter defines the language to use for the Google search
  gl: "us", // parameter defines the country to use for the Google search
};

const URL = `https://play.google.com/store/apps/details?id=${searchParams.id}&hl=${searchParams.hl}&gl=${searchParams.gl}`;

接下来,我们编写一个函数以滚动页面以加载所有评论:

async function scrollPage(page, clickElement, scrollContainer) {
  ...
}

在此功能中,首先,我们需要获得scrollContainer高度(使用koude15方法)。

然后,我们使用while循环,在评论元素上单击(koude17方法)以保持焦点为焦点,等待0.5秒(使用koude18方法),按“ end”按钮滚动到最后一个评论元素,等待2秒并获得新的scrollContainer高度。

接下来,我们检查newHeight是否等于lastHeight,或者是否收到的评论数量超过reviewsLimit,我们会停止循环。否则,我们将newHeight值定义为lastHeight变量,然后重复重复直到页面不滚动到末尾:

let lastHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
while (true) {
  await page.click(clickElement);
  await page.waitForTimeout(500);
  await page.keyboard.press("End");
  await page.waitForTimeout(2000);
  let newHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
  const reviews = await page.$$(".RHo1pe");
  if (newHeight === lastHeight || reviews.length > reviewsLimit) {
    break;
  }
  lastHeight = newHeight;
}

接下来,我们编写一个函数以获取页面上的评论数据:

async function getReviewsFromPage(page) {
  ...
}

在此功能中,我们从页面上下文中获取信息并将其保存在返回的对象中。接下来,我们需要使用".RHo1pe"选择器(koude26方法)获取所有HTML元素。

然后,我们使用koude27方法来迭代使用koude28方法构建的数组:

return await page.evaluate(() => ({
    reviews: Array.from(document.querySelectorAll(".RHo1pe")).map((el) => ({
      ...
    })),
}));

最后,我们需要使用以下方法获取所有数据:

title: el.querySelector(".X5PpBb")?.textContent.trim(),
avatar: el.querySelector(".gSGphe > img")?.getAttribute("srcset")?.slice(0, -3),
rating: parseInt(el.querySelector(".Jx4nYe > div")?.getAttribute("aria-label")?.slice(6)),
snippet: el.querySelector(".h3YV2d")?.textContent.trim(),
likes: parseInt(el.querySelector(".AJTPZc")?.textContent.trim()) || "No likes",
date: el.querySelector(".bp9Aid")?.textContent.trim(),
response: {
    title: el.querySelector(".ocpBU .I6j64d")?.textContent.trim(),
    snippet: el.querySelector(".ocpBU .ras4vb")?.textContent.trim(),
    date: el.querySelector(".ocpBU .I9Jtec")?.textContent.trim(),
},

接下来,编写一个函数来控制浏览器并获取信息:

async function getAppReviews() {
  ...
}

首先,在此功能中,我们需要使用带有当前optionspuppeteer.launch({options})方法来定义browser,例如headless: trueargs: ["--no-sandbox", "--disable-setuid-sandbox"]

这些选项意味着我们将headless模式和数组与arguments一起使用,我们用来允许在线IDE中启动浏览器流程。然后我们打开一个新的page

const browser = await puppeteer.launch({
  headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
  args: ["--no-sandbox", "--disable-setuid-sandbox"],
});

const page = await browser.newPage();

接下来,我们更改默认值(30 sec)等待选择器的时间到60000毫秒(1分钟),以使用koude40方法进行慢速Internet连接,请使用koude42方法访问URL,并使用koude43方法来等待等待,直到选择器加载:< br>

await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
await page.waitForSelector(".qZmL0");

最后,我们检查页面上是否存在“显示所有评论”按钮(使用koude44方法),我们单击它并等待页面滚动,将评论从reviews常数中保存在页面中,请关闭浏览器,然后返回收到的数据:

const moreReviewButton = await page.$("c-wiz[jsrenderer='C7s1K'] .VMq4uf button");

if (moreReviewButton) {
  await page.click("c-wiz[jsrenderer='C7s1K'] .VMq4uf button");
  await page.waitForSelector(".RHo1pe .h3YV2d");
  await scrollPage(page, ".RHo1pe .h3YV2d", ".odk6He");
}
const reviews = await getReviewsFromPage(page);

await browser.close();

return reviews;

现在我们可以启动我们的解析器:

$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file

输出

{
   "reviews":[
      {
         "title":"Faera Rathion",
         "avatar":"https://play-lh.googleusercontent.com/a-/ACNPEu_jb8bwx7nBMUAm6ogXkSy2udBVV7GYnygiESuv=s64-rw",
         "rating":1,
         "snippet":"I would've given this 5 stars a few months ago, being a long time user, but these recent updates have made the app extremely frustrating to use. I get randomly put into channels when I open the app, they scroll me back sometimes hundreds of messages, it's impossible to see all the channels in some Discords, doesn't clear notifications without having to try to fully scroll through a channel I was mentioned in to the point of having to refresh it multiple times and many more consistent issues.",
         "likes":2,
         "date":"October 19, 2022",
         "response":{
            "title":"Discord Inc.",
            "snippet":"We're sorry for the inconvenience. We hear you and our teams are actively working on rolling out fixes daily. If you continue to experience issues, please make sure your app is on the latest updated version. Also, your feedback greatly affects what we focus on so please let us know if you continue to have issues at dis.gd/contact.",
            "date":"October 19, 2022"
         }
      },
      {
         "title":"Avoxx Nepps",
         "avatar":"https://play-lh.googleusercontent.com/a-/ACNPEu_WAW8BQ6SiTqR2gFzjxXpjSjFiAEx3E3cMKGQ1w5o=s64-rw",
         "rating":2,
         "snippet":"The new update has made it borderline unusable. It is extremely glitchy and a lot of times doesn't even work properly. Can't even join a voice call without it leaving and rejoining by itself or muting me for unknown reason. The new video system absolutely sucks. All of the minor inconveniences the previous version had is nothing compared to this update which looks like it was thrown together by a team of teenagers in Middle School in a month for a school project.",
         "likes":"No likes",
         "date":"October 20, 2022",
         "response":{
            "title":"Discord Inc.",
            "snippet":"We'd like to know more about the issues you've encountered after the recent update. Could you please submit a support ticket so we can look into the issue?: dis.gd/contact If you have any suggestions about what should be changed or improved, please share them on our Feedback page here: dis.gd/feedback",
            "date":"October 21, 2022"
         }
      },
      ...and other reviews
   ]
}

usuingaoqian42 from serpapi

本节是为了显示DIY解决方案与我们的解决方案之间的比较。

最大的区别是您不需要从头开始创建解析器并维护它。

也有可能在Google的某个时候阻止请求,我们在后端处理它,因此无需弄清楚如何自己做或弄清楚要使用哪个验证码,代理提供商。

首先,我们需要安装koude46

npm i google-search-results-nodejs

这是full code example,如果您不需要说明:

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(process.env.API_KEY); //your API key from serpapi.com

const reviewsLimit = 100; // hardcoded limit for demonstration purpose

const params = {
  engine: "google_play_product", // search engine
  gl: "us", // parameter defines the country to use for the Google search
  hl: "en", // parameter defines the language to use for the Google search
  store: "apps", // parameter defines the type of Google Play store
  product_id: "com.discord", // Parameter defines the ID of a product you want to get the results for.
  all_reviews: "true", // Parameter is used for retriving all reviews of a product
};

const getJson = () => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  });
};

const getResults = async () => {
  const allReviews = [];
  while (true) {
    const json = await getJson();
    if (json.reviews) {
      allReviews.push(...json.reviews);
    } else break;
    if (json.serpapi_pagination?.next_page_token) {
      params.next_page_token = json.serpapi_pagination?.next_page_token;
    } else break;
    if (allReviews.length > reviewsLimit) break;
  }
  return allReviews;
};

getResults().then((result) => console.dir(result, { depth: null }));

代码说明

首先,我们需要从koude46库中声明SerpApi并使用SerpApi的API键定义新的search实例:

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(API_KEY);

接下来,我们写了要收到多少评论(reviewsLimit常数)和提出请求的必要参数:

const reviewsLimit = 100; // hardcoded limit for demonstration purpose

const params = {
  engine: "google_play_product", // search engine
  gl: "us", // parameter defines the country to use for the Google search
  hl: "en", // parameter defines the language to use for the Google search
  store: "apps", // parameter defines the type of Google Play store
  product_id: "com.discord", // Parameter defines the ID of a product you want to get the results for.
  all_reviews: "true", // Parameter is used for retriving all reviews of a product
};

接下来,我们从Serpapi库中包装搜索方法,以便进一步处理搜索结果:

const getJson = () => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  });
};

最后,我们声明了从页面获取数据并返回的函数getResult

const getResults = async () => {
  ...
};

首先,在此功能中,我们声明一个带有结果数据的数组allReviews

const allReviews = [];

接下来,我们需要使用while循环。在此循环中,我们获得了带有结果的json,检查页面上是否存在reviews,将其推送(koude56方法)将它们放在allReviews array(使用koude58),将next_page_token设置为params对象,然后重复该循环直到结果不存在,直到结果不存在。页面或收到的评论的数量比reviewsLimit

while (true) {
  const json = await getJson();
  if (json.reviews) {
    allReviews.push(...json.reviews);
  } else break;
  if (json.serpapi_pagination?.next_page_token) {
    params.next_page_token = json.serpapi_pagination?.next_page_token;
  } else break;
  if (allReviews.length > reviewsLimit) break;
}
return allReviews;

之后,我们运行getResults函数并使用koude63方法在控制台中打印所有接收的信息,该方法允许您使用带有必要参数的对象来更改默认输出选项:

getResults().then((result) => console.dir(result, { depth: null }));

输出

[
   {
      "title":"Johnathan Kamuda",
      "avatar":"https://play-lh.googleusercontent.com/a-/ACNPEu9QaKcoysS5G21Q5DQxs5nm2pg07GfJa-M_ezvOWfU",
      "rating":5,
      "snippet":"Been using Discord for many, many years. They are always making it better. It's become so much more robust and feature filled since I first started using it. And it's platform to pay for extras is great. You don't NEED to, but it's nice to have that kind of service a available if we wanted some perks. I think some of the options could be laid out better. Personal example - changing individuals volume in a call, not an intuitive option to find at first. Things like that fixed, would be perfect.",
      "likes":29,
      "date":"October 19, 2022"
   },
   {
      "title":"Lark Reid",
      "avatar":"https://play-lh.googleusercontent.com/a-/ACNPEu-RDynxDvoH-8_jnUj48AbZvXYrrafsLP3WT0fyTA",
      "rating":1,
      "snippet":"Ever since the new update me and other people that I know have completely lost the ability to upload more than one image/video at a time. It freezes on 70-100% when uploading multiple at a time. Now the audio on videos that I upload turn into static. I played the videos on my phone to make sure they weren't corrupted, and they are just fine. Sometimes when I open the app it gets stuck connecting and I have to restart it. Please fix your app asap. It's just not my phone that is effected.",
      "likes":84,
      "date":"October 21, 2022",
      "response":{
         "title":"Discord Inc.",
         "snippet":"We're sorry for the inconvenience. We hear you and our teams are actively working on rolling out fixes daily. If you continue to experience issues, please make sure your app is on the latest updated version. Also, your feedback greatly affects what we focus on so please let us know if you continue to have issues at dis.gd/contact.",
         "date":"October 21, 2022"
      }
   },
    ... and other reviews
]

链接

如果您想查看一些用serpapi制定的项目,write me a message


加入我们的Twitter | YouTube

添加Feature Requestð«或Bugð