从电子商务网站提取数据-DEV365 开发者社区

基本的Web刮擦是数据分析师的必需品之一。获取自己的数据以项目目的的能力是被低估的任务。

我最近从尼日利亚的4家大型艺术商店（网站）中删除了一些数据，我想分享用于学习目的的代码（包括chatgpt的代码）（其他数据分析师可能会觉得有用）。

。

第一个网站是Crafts Village我吓到了Art-Tools类别。

刮擦网站的代码

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

# Initialize lists to store the data
product_names = []
prices = []

# Scrape all 6 pages
for page in range(1, 7):
    url = f"https://craftsvillage.com.ng/product-category/art-tools/"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # Find the relevant HTML elements for product information
    products = soup.find_all("li", class_="product")

    # Extract data from each product element
    for product in products:
        # Product name
        name_element = product.find("a", class_="woocommerce-LoopProduct-link")
        name = name_element.text.replace("\n", "").strip()
        name = re.sub(r"[₦\,|–]", "", name)  # Remove unwanted characters
        product_names.append(name)


        # Price
        price_element = product.find("bdi")
        price = price_element.text if price_element else None
        prices.append(price)

# Create a Pandas DataFrame from the scraped data
data = {
    "Product Name": product_names,
    "Price": prices
}
df = pd.DataFrame(data)

# Remove "\n\n\n\n\n" from "Product Name" column
df["Product Name"] = df["Product Name"].str.replace("\n", "")

# Display the Data Frame
print(df)

要获取名称元素类，我通过将光标放在产品名称上，从浏览器中检查了名称类，请单击我的鼠标垫并单击Inspect。

我也为价格

也做了同样的事情

上面的代码从艺术工具类别中的所有6页中提取了产品名称和价格。

这是我从Crafties Hobbies
中删除信息的方式

import requests
from bs4 import BeautifulSoup
import pandas as pd

base_url = 'https://craftieshobbycraft.com/product-category/painting-drawing/page/{}/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

# Create lists to store data
categories = []
product_names = []
product_prices = []

# Iterate over each page
for page in range(1, 8):
    url = base_url.format(page)
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    category_elements = soup.find_all('p', class_='category uppercase is-smaller no-text-overflow product-cat op-7')
    product_names_elements = soup.find_all('a', class_='woocommerce-LoopProduct-link woocommerce-loop-product__link')
    product_prices_elements = soup.find_all('bdi')

    for category_element, product_name_element, product_price_element in zip(category_elements, product_names_elements, product_prices_elements):
        category = category_element.get_text(strip=True)
        product_name = product_name_element.get_text(strip=True)
        product_price = product_price_element.get_text(strip=True)

        categories.append(category)
        product_names.append(product_name)
        product_prices.append(product_price)

# Create a pandas DataFrame
data = {
    'Category': categories,
    'Product Name': product_names,
    'Product Price': product_prices
}
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

这是我从Kaenves store刮擦数据的方式

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Create empty lists to store the data
product_names = []
prices = []

# Iterate through each page
for page in range(1, 4):
    # Send a GET request to the page
    url = f"https://www.kaenves.store/collections/floating-wood-frame?page={page}"
    response = requests.get(url)

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all span elements with the specified class
    price_elements = soup.find_all('span', class_='price-item price-item--regular')
    name_elements = soup.find_all('h3', class_='card__heading h5')

    # Extract the prices and product names
    for price_element, name_element in zip(price_elements, name_elements):
        price = price_element.get_text(strip=True)
        name = name_element.get_text(strip=True)
        product_names.append(name)
        prices.append(price)

# Create a pandas DataFrame
data = {'Product Name': product_names, 'Price': prices}
df = pd.DataFrame(data)

# Save the DataFrame as a CSV file
df.to_csv('paperandboard.csv', index=False)

这是我从Art Easy刮擦数据的方式

import requests
from bs4 import BeautifulSoup
import pandas as pd

prices = []
product_names = []

# Iterate over all 2 pages
for page_num in range(1, 3):
    url = f"https://arteasy.com.ng/product-category/canvas-surfaces/page/{page_num}/"

    # Send a GET request to the URL
    response = requests.get(url)

    # Parse the HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all the span elements with class "price"
    product_prices = [span.get_text(strip=True) for span in soup.find_all("span", class_="price")]

    # Find all the h3 elements with class "product-title"
    product_names += [product_name.get_text(strip=True) for product_name in soup.find_all("h3", class_="product-title")]

    # Add the prices to the list
    prices += product_prices

# Check if the lengths of product_names and prices are equal
if len(product_names) == len(prices):
    # Create a pandas DataFrame
    data = {"Product Name": product_names, "Price": prices}
    df = pd.DataFrame(data)

    # Print the DataFrame
    print(df)
else:
    print("Error: The lengths of product_names and prices are not equal.")

如果要重复使用此代码，请确保将URL更改为您喜欢的电子商务网站，并将类别更改为您的URL产品名称和产品价格类别

这些信息可用于以下;

价格比较：您可以使用刮擦数据比较不同网站的产品价格。这可以帮助您找到所需产品的最佳交易。
产品研究：您可以将刮擦数据用于研究产品。这可以帮助您了解有关产品功能，规格和评论的更多信息。
市场分析：您可以使用刮擦数据来分析特定产品的市场。这可以帮助您确定趋势和机会。
产品建议：您可以使用刮擦数据向用户推荐产品。这可以帮助您提高销售并提高客户满意度。