基本的Web刮擦是数据分析师的必需品之一。获取自己的数据以项目目的的能力是被低估的任务。
我最近从尼日利亚的4家大型艺术商店(网站)中删除了一些数据,我想分享用于学习目的的代码(包括chatgpt的代码)(其他数据分析师可能会觉得有用)。
。第一个网站是Crafts Village我吓到了Art-Tools类别。
刮擦网站的代码
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
# Initialize lists to store the data
product_names = []
prices = []
# Scrape all 6 pages
for page in range(1, 7):
url = f"https://craftsvillage.com.ng/product-category/art-tools/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# Find the relevant HTML elements for product information
products = soup.find_all("li", class_="product")
# Extract data from each product element
for product in products:
# Product name
name_element = product.find("a", class_="woocommerce-LoopProduct-link")
name = name_element.text.replace("\n", "").strip()
name = re.sub(r"[₦\,|–]", "", name) # Remove unwanted characters
product_names.append(name)
# Price
price_element = product.find("bdi")
price = price_element.text if price_element else None
prices.append(price)
# Create a Pandas DataFrame from the scraped data
data = {
"Product Name": product_names,
"Price": prices
}
df = pd.DataFrame(data)
# Remove "\n\n\n\n\n" from "Product Name" column
df["Product Name"] = df["Product Name"].str.replace("\n", "")
# Display the Data Frame
print(df)
要获取名称元素类,我通过将光标放在产品名称上,从浏览器中检查了名称类,请单击我的鼠标垫并单击Inspect。
我也为价格
也做了同样的事情上面的代码从艺术工具类别中的所有6页中提取了产品名称和价格。
这是我从Crafties Hobbies
中删除信息的方式
import requests
from bs4 import BeautifulSoup
import pandas as pd
base_url = 'https://craftieshobbycraft.com/product-category/painting-drawing/page/{}/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
# Create lists to store data
categories = []
product_names = []
product_prices = []
# Iterate over each page
for page in range(1, 8):
url = base_url.format(page)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
category_elements = soup.find_all('p', class_='category uppercase is-smaller no-text-overflow product-cat op-7')
product_names_elements = soup.find_all('a', class_='woocommerce-LoopProduct-link woocommerce-loop-product__link')
product_prices_elements = soup.find_all('bdi')
for category_element, product_name_element, product_price_element in zip(category_elements, product_names_elements, product_prices_elements):
category = category_element.get_text(strip=True)
product_name = product_name_element.get_text(strip=True)
product_price = product_price_element.get_text(strip=True)
categories.append(category)
product_names.append(product_name)
product_prices.append(product_price)
# Create a pandas DataFrame
data = {
'Category': categories,
'Product Name': product_names,
'Product Price': product_prices
}
df = pd.DataFrame(data)
# Print the DataFrame
print(df)
这是我从Kaenves store刮擦数据的方式
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Create empty lists to store the data
product_names = []
prices = []
# Iterate through each page
for page in range(1, 4):
# Send a GET request to the page
url = f"https://www.kaenves.store/collections/floating-wood-frame?page={page}"
response = requests.get(url)
# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Find all span elements with the specified class
price_elements = soup.find_all('span', class_='price-item price-item--regular')
name_elements = soup.find_all('h3', class_='card__heading h5')
# Extract the prices and product names
for price_element, name_element in zip(price_elements, name_elements):
price = price_element.get_text(strip=True)
name = name_element.get_text(strip=True)
product_names.append(name)
prices.append(price)
# Create a pandas DataFrame
data = {'Product Name': product_names, 'Price': prices}
df = pd.DataFrame(data)
# Save the DataFrame as a CSV file
df.to_csv('paperandboard.csv', index=False)
这是我从Art Easy刮擦数据的方式
import requests
from bs4 import BeautifulSoup
import pandas as pd
prices = []
product_names = []
# Iterate over all 2 pages
for page_num in range(1, 3):
url = f"https://arteasy.com.ng/product-category/canvas-surfaces/page/{page_num}/"
# Send a GET request to the URL
response = requests.get(url)
# Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
# Find all the span elements with class "price"
product_prices = [span.get_text(strip=True) for span in soup.find_all("span", class_="price")]
# Find all the h3 elements with class "product-title"
product_names += [product_name.get_text(strip=True) for product_name in soup.find_all("h3", class_="product-title")]
# Add the prices to the list
prices += product_prices
# Check if the lengths of product_names and prices are equal
if len(product_names) == len(prices):
# Create a pandas DataFrame
data = {"Product Name": product_names, "Price": prices}
df = pd.DataFrame(data)
# Print the DataFrame
print(df)
else:
print("Error: The lengths of product_names and prices are not equal.")
如果要重复使用此代码,请确保将URL更改为您喜欢的电子商务网站,并将类别更改为您的URL产品名称和产品价格类别
这些信息可用于以下;
-
价格比较:您可以使用刮擦数据比较不同网站的产品价格。这可以帮助您找到所需产品的最佳交易。
-
产品研究:您可以将刮擦数据用于研究产品。这可以帮助您了解有关产品功能,规格和评论的更多信息。
-
市场分析:您可以使用刮擦数据来分析特定产品的市场。这可以帮助您确定趋势和机会。
-
产品建议:您可以使用刮擦数据向用户推荐产品。这可以帮助您提高销售并提高客户满意度。