时尚赛是一个流行的在线平台,用于二手奢侈品时尚物品。它以仔细的产品策划而闻名,它是数据质量特别高的二手豪华时尚物品的理想选择。
在本教程中,我们将快速研究如何使用Python和hidden web data scraping Technique刮擦Fashionphile.com。这是一个超级简单的刮擦,所以让我们潜入!
为什么要刮擦时尚球?
奢侈品时装市场正在迅速增长,相关的二手交易也是如此。 Fashionphile是该地区最大的店面之一(以及Vestiaire Collective,StockX等)。它是时尚品牌,零售商和市场研究人员的重要数据来源。刮擦和跟踪产品性能可能是主要的竞争优势,也是有用的业务分析工具。
有关Web刮擦用途的更多信息,请参见我们的web scraping use case hub。
刮擦预览
在本教程中,我们将重点放在刮擦产品数据上,我们将使用隐藏的Web数据刮擦技术抓住整个可用数据集。这是最终数据集的JSON格式示例,我们可以在本指南的末尾刮擦:
示例FullashionPhile产品数据集
{
"id": 1048096,
"sku": "BW",
"title": "BOTTEGA VENETA Nappa Twisted Padded Intrecciato Curve Slide Sandals 36 Black",
"slug": "/p/bottega-veneta-nappa-twisted-padded-intrecciato-curve-slide-sandals-36-black-1048096",
"price": 950,
"renewDays": 0,
"salePrice": null,
"retailPrice": 1650,
"discountedPrice": 900,
"discountEnabled": 1,
"discountedTier": 1,
"isSuperSale": 0,
"madeAvailableAt": "2023-03-10 23:59:22",
"madeAvailableAtUTC": "2023-03-11 07:59:22",
"soldAt": null,
"viewCount": 0,
"length": 0,
"width": 0,
"height": 0,
"drop": 0,
"weight": 1,
"season": null,
"year": null,
"location": "New York, New York",
"condition": "Excellent",
"conditions": [
"scuffs",
"imprints",
"marks on sole(s)"
],
"productColors": null,
"productColorsAndQuantitiesMap": null,
"isFashionphileMerchandise": false,
"isSwagItem": false,
"isGiftCard": false,
"isQualifiedForLayaway": true,
"isTooNewForLayaway": false,
"isEligibleForBuyBack": false,
"isJewelry": false,
"description": "This is an authentic pair of BOTTEGA VENETA Nappa Intrecciato Padded BV Curve Sandals size 36 in Black. These stylish strappy sandals are crafted of padded and twisted Intrecciato leather in black. These heels feature interwoven strap detailing and a 4-inch heel.",
"exteriorDescription": null,
"handleDescription": null,
"interiorDescription": null,
"hardwareDescription": null,
"conditionDescription": null,
"titleWithoutBrand": "Nappa Twisted Padded Intrecciato Curve Slide Sandals 36 Black",
"saleDurationInDays": null,
"brand": [
{
"id": 89,
"name": "Bottega Veneta",
"slug": "bottega-veneta",
"type": "brand",
"description": "Shop authentic used Bottega Veneta shoes & handbags at a discounted price. FASHIONPHILE has the largest selection of used Bottega Veneta on sale online.",
"title": "Shop Bottega Veneta | Cassette, Jodie, & Pouch Handbags | FASHIONPHILE",
"parent_id": null,
"classification": "0",
"bio": "Bottega Veneta (translated as “Venetian shop”) is a luxury fashion house that was established in 1966 by Michele Taddei and Renzo Zengiaro. Best known for its leather goods, Bottega Veneta developed their own weaving method, called “intrecciato,” that crosses the leather in a braid-like pattern. This interwoven design would become the brand’s trademark. It was the beautifully handcrafted designs and the quality of their materials, which were further accentuated by an unassuming and logo-less design, that gained Bottega Veneta notoriety in those early years. \r\n\r\nCo-founder Renzo Zengiaro left Bottega Veneta in the late 1970s, with Michele Taddei following suit a few years later. Taddei’s ex-wife, Laura Moltedo, and her husband Vittorio moved from the States to Italy to take ownership of the company. \r\n\r\nThe decade of the 1980s saw the rise of Bottega Veneta’s popularity among celebrities around the world. Andy Warhol was one of Bottega Veneta’s most fervent fans, and the famous artist even made a short film to advertise the brand. But despite these efforts, the company took a financial downturn. In response, Bottega Veneta changed its design in the 1990s to one that more directly reflected the trends of the time.\r\n\r\nThe Gucci Group bought Bottega Veneta in 2001, with German fashion designer Tomas Maier as the company’s new Creative Director. He presented his first collection that year as the brand’s 2002 Spring/Summer Collection. Formerly affiliated with prestigious fashion houses Sonia Rykiel and Hermès, Maier brought his vast experience to Bottega Veneta and worked to restore the brand’s original and distinctive identity. To bring this about, Maier made the decision to strip any visible logos from products and include more of the brand’s original handcrafted work, including the intrecciato weave that formerly characterized the brand. These changes worked and the Bottega Veneta company, and image, was revived.\r\n\r\nBottega Veneta began introducing new additions to its existing lines, including fine jewelry and fragrance as well as handbags, small leather goods, shoes, gifts, and even home furniture. In 2005, the company released a women’s ready-to-wear line — the brand’s first — and followed it up with a men’s line in 2006. That same year, the company opened the Scuola della Pelletteria, a training school with the purpose of supporting the dwindling number of leatherworkers dedicated to the art of handcrafted design. It is from this school that the brand will select future leather artisans for Bottega Veneta. \r\n\r\nThough Bottega Veneta offers an assortment of clothing, fragrances, and home furnishings, their leather goods remain the company’s specialty. Bottega Veneta handbags, with their quintessential interwoven straps of leather, are considered by many as the height of sophistication.\r\n",
"is_feature": 0,
"is_enabled_for_quotes": 1,
"quote_image_angles": "",
"is_outlet_brand": 0,
"is_eligible_for_buyback": 1,
"created_at": "2016-03-29 11:25:43",
"updated_at": "2022-05-20 16:03:24",
"deleted_at": null,
"is_enabled_for_authentication_prediction": 0,
"pivot": {
"product_id": 1048096,
"category_id": 89
}
}
],
"measurements": [
{
"id": 5135876,
"product_id": 1048096,
"type": "size",
"unit": "EU",
"value": 36,
"adjustment_value": null
},
{
"id": 5135877,
"product_id": 1048096,
"type": "heel",
"unit": "in",
"value": 4,
"adjustment_value": null
}
],
"shipsWith": "2 dust bags, box",
"designerId": null,
"color": "Black",
"brandName": "Bottega Veneta",
"categories": [
{
"id": 168,
"name": "Shoes"
},
{
"id": 706,
"name": "Alfresco Accents"
},
{
"id": 677,
"name": "Our Gift to You"
},
{
"id": 419,
"name": "RSVP-Worthy"
},
{
"id": 680,
"name": "Spring Refresh Offer"
},
{
"id": 679,
"name": "Vacation Mode"
},
{
"id": 624,
"name": "Woven Wants"
},
{
"id": 724,
"name": "Year-End Event"
},
{
"id": 192,
"name": "Black"
},
{
"id": 205,
"name": "Leather"
},
{
"id": 323,
"name": "Solid Color"
},
{
"id": 350,
"name": "36"
},
{
"id": 380,
"name": "Pumps"
},
{
"id": 381,
"name": "Sandals"
},
{
"id": 164,
"name": "Accessories"
},
{
"id": 451,
"name": "Spring Style"
}
],
"isExcludedFromPromo": false,
"subCategories": [
"Shoes",
"Alfresco Accents",
"Our Gift to You",
"RSVP-Worthy",
"Spring Refresh Offer",
"Vacation Mode",
"Woven Wants",
"Year-End Event",
"Black",
"Leather",
"Solid Color",
"36",
"Pumps",
"Sandals",
"Spring Style"
],
"giftable": false,
"lastCall": false,
"featuredImage": {
"large": "https://prod-images.fashionphile.com/large/06c36eb9816bf3e6be63834eb7d33200/eaa3a63349a686dadb8198c8cdabc386.jpg",
"main": "https://prod-images.fashionphile.com/main/06c36eb9816bf3e6be63834eb7d33200/eaa3a63349a686dadb8198c8cdabc386.jpg",
"thumb": "https://prod-images.fashionphile.com/thumb/06c36eb9816bf3e6be63834eb7d33200/eaa3a63349a686dadb8198c8cdabc386.jpg"
},
"images": [
{
"thumb": "https://prod-images.fashionphile.com/thumb/06c36eb9816bf3e6be63834eb7d33200/eaa3a63349a686dadb8198c8cdabc386.jpg",
"main": "https://prod-images.fashionphile.com/main/06c36eb9816bf3e6be63834eb7d33200/eaa3a63349a686dadb8198c8cdabc386.jpg",
"large": "https://prod-images.fashionphile.com/large/06c36eb9816bf3e6be63834eb7d33200/eaa3a63349a686dadb8198c8cdabc386.jpg",
"altText": "Bottega Veneta Nappa Twisted Padded Intrecciato Curve Slide Sandals 36 Black image 1 of 10"
},
{
"thumb": "https://prod-images.fashionphile.com/thumb/06c36eb9816bf3e6be63834eb7d33200/609f080b0b90e1d9a8e6d2b4b164ac91.jpg",
"main": "https://prod-images.fashionphile.com/main/06c36eb9816bf3e6be63834eb7d33200/609f080b0b90e1d9a8e6d2b4b164ac91.jpg",
"large": "https://prod-images.fashionphile.com/large/06c36eb9816bf3e6be63834eb7d33200/609f080b0b90e1d9a8e6d2b4b164ac91.jpg",
"altText": "Bottega Veneta Nappa Twisted Padded Intrecciato Curve Slide Sandals 36 Black image 2 of 10"
},
{
"thumb": "https://prod-images.fashionphile.com/thumb/06c36eb9816bf3e6be63834eb7d33200/7babd761c2efc32c7949579820f7e732.jpg",
"main": "https://prod-images.fashionphile.com/main/06c36eb9816bf3e6be63834eb7d33200/7babd761c2efc32c7949579820f7e732.jpg",
"large": "https://prod-images.fashionphile.com/large/06c36eb9816bf3e6be63834eb7d33200/7babd761c2efc32c7949579820f7e732.jpg",
"altText": "Bottega Veneta Nappa Twisted Padded Intrecciato Curve Slide Sandals 36 Black image 3 of 10"
},
{
"thumb": "https://prod-images.fashionphile.com/thumb/06c36eb9816bf3e6be63834eb7d33200/8e3bf43e3fcc1202db72c3693eace5d0.jpg",
"main": "https://prod-images.fashionphile.com/main/06c36eb9816bf3e6be63834eb7d33200/8e3bf43e3fcc1202db72c3693eace5d0.jpg",
"large": "https://prod-images.fashionphile.com/large/06c36eb9816bf3e6be63834eb7d33200/8e3bf43e3fcc1202db72c3693eace5d0.jpg",
"altText": "Bottega Veneta Nappa Twisted Padded Intrecciato Curve Slide Sandals 36 Black image 4 of 10"
},
{
"thumb": "https://prod-images.fashionphile.com/thumb/06c36eb9816bf3e6be63834eb7d33200/e144283f721ab625d5d10980d2782f8d.jpg",
"main": "https://prod-images.fashionphile.com/main/06c36eb9816bf3e6be63834eb7d33200/e144283f721ab625d5d10980d2782f8d.jpg",
"large": "https://prod-images.fashionphile.com/large/06c36eb9816bf3e6be63834eb7d33200/e144283f721ab625d5d10980d2782f8d.jpg",
"altText": "Bottega Veneta Nappa Twisted Padded Intrecciato Curve Slide Sandals 36 Black image 5 of 10"
},
{
"thumb": "https://prod-images.fashionphile.com/thumb/06c36eb9816bf3e6be63834eb7d33200/902794b1806144a205924db1f4f74bd3.jpg",
"main": "https://prod-images.fashionphile.com/main/06c36eb9816bf3e6be63834eb7d33200/902794b1806144a205924db1f4f74bd3.jpg",
"large": "https://prod-images.fashionphile.com/large/06c36eb9816bf3e6be63834eb7d33200/902794b1806144a205924db1f4f74bd3.jpg",
"altText": "Bottega Veneta Nappa Twisted Padded Intrecciato Curve Slide Sandals 36 Black image 6 of 10"
},
{
"thumb": "https://prod-images.fashionphile.com/thumb/06c36eb9816bf3e6be63834eb7d33200/768cda285b970f0f1e1e997698bb8bfa.jpg",
"main": "https://prod-images.fashionphile.com/main/06c36eb9816bf3e6be63834eb7d33200/768cda285b970f0f1e1e997698bb8bfa.jpg",
"large": "https://prod-images.fashionphile.com/large/06c36eb9816bf3e6be63834eb7d33200/768cda285b970f0f1e1e997698bb8bfa.jpg",
"altText": "Bottega Veneta Nappa Twisted Padded Intrecciato Curve Slide Sandals 36 Black image 7 of 10"
},
{
"thumb": "https://prod-images.fashionphile.com/thumb/06c36eb9816bf3e6be63834eb7d33200/dd1dca41b0823810c484c91535b7ca4c.jpg",
"main": "https://prod-images.fashionphile.com/main/06c36eb9816bf3e6be63834eb7d33200/dd1dca41b0823810c484c91535b7ca4c.jpg",
"large": "https://prod-images.fashionphile.com/large/06c36eb9816bf3e6be63834eb7d33200/dd1dca41b0823810c484c91535b7ca4c.jpg",
"altText": "Bottega Veneta Nappa Twisted Padded Intrecciato Curve Slide Sandals 36 Black image 8 of 10"
},
{
"thumb": "https://prod-images.fashionphile.com/thumb/06c36eb9816bf3e6be63834eb7d33200/b9d9625bfdde85cdd0f679a62d507971.jpg",
"main": "https://prod-images.fashionphile.com/main/06c36eb9816bf3e6be63834eb7d33200/b9d9625bfdde85cdd0f679a62d507971.jpg",
"large": "https://prod-images.fashionphile.com/large/06c36eb9816bf3e6be63834eb7d33200/b9d9625bfdde85cdd0f679a62d507971.jpg",
"altText": "Bottega Veneta Nappa Twisted Padded Intrecciato Curve Slide Sandals 36 Black image 9 of 10"
},
{
"thumb": "https://prod-images.fashionphile.com/thumb/06c36eb9816bf3e6be63834eb7d33200/718d97bb4e4f6c3d68b74856430378de.jpg",
"main": "https://prod-images.fashionphile.com/main/06c36eb9816bf3e6be63834eb7d33200/718d97bb4e4f6c3d68b74856430378de.jpg",
"large": "https://prod-images.fashionphile.com/large/06c36eb9816bf3e6be63834eb7d33200/718d97bb4e4f6c3d68b74856430378de.jpg",
"altText": "Bottega Veneta Nappa Twisted Padded Intrecciato Curve Slide Sandals 36 Black image 10 of 10"
}
],
"followingCount": "40",
"breadcrumbs": [
{
"label": "Bottega Veneta: All",
"href": "/shop/brands/bottega-veneta"
},
{
"label": "accessories",
"href": "/shop/categories/accessories?brands=bottega-veneta"
},
{
"label": "Shoes",
"href": "/shop/accessories/shoes?brands=bottega-veneta"
},
{
"label": "BOTTEGA VENETA Nappa Twisted Padded Intrecciato Curve Slide Sandals 36 Black"
}
],
"primaryCategory": "Shoes",
"conditionsMap": {
"interior": [
"scuffs",
"imprints"
],
"other": [
"marks on sole(s)"
]
},
"pullRequestedAt": null,
"isWatch": false,
"isPurchasable": true,
"parentCategory": "Accessories",
"daysOnSale": 31,
"recommendedProducts": [],
"brandUrl": "/shop/brands/bottega-veneta",
"isSizeRef": false,
"conditionsText": "scuffs, imprints, marks on sole(s)",
"discount": "5% off",
"dos": 5,
"shipsWithList": [
"2 dust bags",
" box"
],
"oldSlug": "bottega-veneta-nappa-twisted-padded-intrecciato-curve-slide-sandals-36-black-1048096",
"layawayDownpaymentAmount": "$225",
"url": "https://apigateway.fashionphile.com/product/1048096",
"productType": "BRAND_PRODUCT",
"authenticCta": "We guarantee this is an authentic Bottega Veneta item or 100% of your money back. ",
"disclaimer": "Bottega Veneta\n is a registered trademark of\n Bottega Veneta. FASHIONPHILE is not affiliated with\n Bottega Veneta."
}
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
设置
要刮擦时尚球,我们只需要一些在网络刮擦中使用的python软件包即可。由于我们将使用hidden web data scraping方法,我们需要的只是HTTP客户端和CSS选择器引擎:
- httpx-功能强大的HTTP客户端,我们将使用它来检索HTML页面。
- parsel -HTML解析器我们将使用CSS selectors提取隐藏的JSON数据集
可以使用Python的pip
台命令来安装这些软件包:
$ pip install httpx parsel
对于Scrapfly users,每个代码示例也有一个Scrapfly SDK版本。 SDK也可以使用pip
安装:
$ pip install "scrapfly-sdk[all]"
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
刮擦产品数据
开始,让我们看一下如何刮擦单个产品页面。例如,让我们从网站折扣部分中获取产品:
我们可以使用传统的HTML解析工具,例如XPath,并从HTML页面解析产品详细信息,但是现代的网络刮擦技术可以使这一任务变得更加容易!
相反,如果我们查看页面源和搜索(CTRL+F)以获取唯一产品标识符(例如描述,标题或代码),我们可以看到整个产品数据集可在JSON中使用:
这表明网站使用现代JavaScript框架(例如React或Next.js),该框架隐藏了HTML主体中的数据集。在上面的示例中,我们可以看到它在<script id="__NEXT_DATA">
HTML元素下。
这称为隐藏的Web数据刮擦,这是一种从使用JavaScript框架(例如Next.js)的网站上刮下数据的非常简单有效的方法。要刮擦我们必须做的所有事情:
- 检索产品的HTML页面。
- 使用CSS选择器(使用
parsel
)找到隐藏的JSON数据集。 - 使用
json.loads
加载json作为python词典。 - 选择产品字段。
用Python刮擦时,这看起来很简单:
python p>
刮擦
import asyncio
import json
import httpx
from parsel import Selector
# create HTTP client with web-browser like headers and http2 support
client = httpx.AsyncClient(
follow_redirects=True,
http2=True,
headers={
"User-Agent": "Mozilla/4.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=-1.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
},
)
def find_hidden_data(html) -> dict:
"""extract hidden web cache from page html"""
# use CSS selectors to find script tag with data
data = Selector(html).css("script# __NEXT_DATA__ ::text").get()
return json.loads(data)
async def scrape_product(url: str):
# retrieve page HTML
response = await client.get(url)
# find hidden web data
data = find_hidden_data(response.text)
# extract only product data from the page dataset
product = data["props"]["pageProps"]["initialState"]["productPageReducer"]["productData"]
return product
# example scrape run:
print(asyncio.run(scrape_product("https://www.fashionphile.com/p/bottega-veneta-nappa-twisted-padded-intrecciato-curve-slide-sandals-36-black-1048096")))
import asyncio
import json
from urllib.parse import parse_qs, urlencode, urlparse
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
def find_hidden_data(result: ScrapeApiResponse) -> dict:
"""extract hidden NEXT_DATA from page html"""
data = result.selector.css("script# __NEXT_DATA__ ::text").get()
data = json.loads(data)
return data
async def scrape_product(url: str) -> dict:
"""scrape a single stockx product page for product data"""
result = await scrapfly.async_scrape(
ScrapeConfig(
url=url,
cache=True,
asp=True,
)
)
data = find_hidden_data(result)
product = data["props"]["pageProps"]["initialState"]["productPageReducer"]["productData"]
return product
def update_url_parameter(url, **params):
"""update url query parameter of an url with new values"""
current_params = parse_qs(urlparse(url).query)
updated_query_params = urlencode({ **current_params,** params}, doseq=True)
return f"{url.split('?')[0]}?{updated_query_params}"
# example scrape
example = scrape_product(
"https://www.fashionphile.com/p/bottega-veneta-nappa-twisted-padded-intrecciato-curve-slide-sandals-36-black-1048096"
)
print(asyncio.run(example))
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
刮擦搜索和类别
现在,我们知道如何刮擦单个产品的数据,让我们看一下如何扩展刮板。要查找更多产品,我们可以使用搜索页面或探索每个单个类别。每个目录(搜索或类别页面)都使用分页,这意味着我们需要刮擦多个页面以刮擦产品数据。
例如,让我们看一下“销售”类别页面:
fashionphile.com/shop/discounted/all
我们可以看到它由数十页组成,就像产品页面一样,它在同一位置包含隐藏的Web数据。就在这次,隐藏的Web数据不包含单个产品数据,而是整个页面的数据。
所以,要刮擦时尚档案的分页部分,我们将使用一种非常简单的分页刮擦技术:
- 刮擦目录/搜索的第一页。
- 查找隐藏的Web数据(使用
parsel
和CSS选择器)。 - 从隐藏的Web数据中提取产品数据。
- 从隐藏的Web数据中提取总页面计数。
- 同时重复其他页面。
在实用的python中,这看起来像这样:
python p>
刮擦
import asyncio
import json
from typing import Dict, List
from urllib.parse import parse_qs, urlencode, urlparse
import httpx
from parsel import Selector
# create HTTP client with web-browser like headers and http2 support
client = httpx.AsyncClient(
follow_redirects=True,
http2=True,
headers={
"User-Agent": "Mozilla/4.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=-1.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
},
limits=httpx.Limits(max_connections=3), # we can limit concurrency to prevent blocking
)
def find_hidden_data(html) -> dict:
"""extract hidden web cache from page html"""
# use CSS selectors to find script tag with data
data = Selector(html).css("script# __NEXT_DATA__ ::text").get()
return json.loads(data)
def update_url_parameter(url, **params):
"""update url query parameter of an url with new values"""
current_params = parse_qs(urlparse(url).query)
updated_query_params = urlencode({ **current_params,** params}, doseq=True)
return f"{url.split('?')[0]}?{updated_query_params}"
async def scrape_paging(url: str, max_pages: int = 10) -> List[Dict]:
print(f"scraping product discovery paging {url}")
# scrape first page
response_first_page = await client.get(url)
data_first_page = find_hidden_data(response_first_page)
data_first_page = data_first_page["props"]["pageProps"]["initialState"]["listingPageReducer"]["listingData"]
results = data_first_page["results"]
# find total page count
total_pages = data_first_page["pages"]
if max_pages and max_pages < total_pages:
total_pages = max_pages
# scrape remaining pages
print(f"scraping remaining total pages: {total_pages-1} concurrently")
to_scrape = [
asyncio.create_task(client.get(update_url_parameter(url, page=page)))
for page in range(2, total_pages+1)
]
for response in await asyncio.gather(*to_scrape):
data = find_hidden_data(response)
data = data["props"]["pageProps"]["initialState"]["listingPageReducer"]["listingData"]
results.extend(data["results"])
return results
# example scrape run - scrape first 3 pages of discounted products:
print(asyncio.run(scrape_paging("https://www.fashionphile.com/shop/discounted/all", max_pages=3))
import asyncio
import json
from typing import Dict, List
from urllib.parse import parse_qs, urlencode, urlparse
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
def find_hidden_data(result: ScrapeApiResponse) -> dict:
"""extract hidden NEXT_DATA from page html"""
data = result.selector.css("script# __NEXT_DATA__ ::text").get()
data = json.loads(data)
return data
def update_url_parameter(url, **params):
"""update url query parameter of an url with new values"""
current_params = parse_qs(urlparse(url).query)
updated_query_params = urlencode({ **current_params,** params}, doseq=True)
return f"{url.split('?')[0]}?{updated_query_params}"
async def scrape_paging(url: str, max_pages: int = 10) -> List[Dict]:
print(f"scraping product discovery paging {url}")
# scrape first page
result_first_page = await scrapfly.async_scrape(ScrapeConfig(url=url, asp=True))
data_first_page = find_hidden_data(result_first_page)
data_first_page = data_first_page["props"]["pageProps"]["initialState"]["listingPageReducer"]["listingData"]
results = data_first_page["results"]
# find total page count
total_pages = data_first_page["pages"]
if max_pages and max_pages < total_pages:
total_pages = max_pages
# scrape remaining pages
print(f"scraping remaining total pages: {total_pages-1} concurrently")
to_scrape = [ScrapeConfig(update_url_parameter(url, page=page), asp=True) for page in range(2, total_pages + 1)]
async for result in scrapfly.concurrent_scrape(to_scrape):
data = find_hidden_data(result)
data = data["props"]["pageProps"]["initialState"]["listingPageReducer"]["listingData"]
results.extend(data["results"])
return results
example = scrape_paging("https://www.fashionphile.com/shop/discounted/all", max_pages=3)
print(asyncio.run(example))
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
避免用刮擦蝇阻塞
最后,要扩大我们的刮板并刮擦所有结果,我们将需要一种方法来避免时尚派使用的所有方式来识别和阻止刮板。为此,我们可以使用Scrapfly Web刮擦API,可以为我们检索页面内容。
scrapfly可以轻松地使用以下功能的功能供电:
- Millions of Residential Proxies
- Anti Scraping Protection bypass
- Javascript rendering and headless cloud browsers
- Web dashboard for monitoring and managing scrapers
所有这些工具都可以通过Python SDK轻松访问:
from scrapfly import ScrapeConfig, ScrapflyClient
client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
url="https://www.vestiairecollective.com/women-clothing/knitwear/anine-bing/beige-cotton-anine-bing-knitwear-32147447.shtml",
# enable scraper blocking service bypass
asp=True
# optional - render javascript using headless browsers:
render_js=True,
))
print(result.content)
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
常问问题
要在Web刮擦时尚档案上结束本指南,让我们看一些常见问题。
刮擦时尚球是合法的吗?
是。我们刮擦的所有数据均可公开使用,这是完全合法的。因此,只要我们不损害网站,刮擦时尚菲尔..com产品数据是完全合法的。
时尚球员会被爬行吗?
是。爬行是网络刮擦的一种形式,刮板会自行发现产品清单。 Fashionphile为网络爬网提供了许多机会,例如使用sitemaps to discover product pages或以下相关产品部分。
概括
在此快速指南中,我们使用了Python和隐藏的Web数据刮擦来刮擦时尚球产品数据。为了检索产品数据,我们使用CSS selectors的parsel
从<script id=" __NEXT_DATA__">
元素中提取隐藏的Web数据。然后,我们要做的就是从页面数据集中选择产品数据。
为了找到更多产品,我们探索了搜索和类别页面刮擦。我们遵循一种简单的分页刮擦技术,使用相同的隐藏的Web数据刮擦方法刮擦所有页面。
最后,我们使用了刮擦网络刮擦API来扩展刮板并刮擦所有结果。免费尝试!
完整的刮板代码
这是使用python和scrapfly python sdk的完整时尚球员产品刮刀:
ð此代码仅应用作参考。要大规模从Fashionphile刮擦数据,您需要将其调整为您的偏好和环境
import asyncio
import json
from pathlib import Path
from typing import Dict, List
from urllib.parse import parse_qs, urlencode, urlparse
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
def find_hidden_data(result: ScrapeApiResponse) -> dict:
"""extract hidden NEXT_DATA from page html"""
data = result.selector.css("script# __NEXT_DATA__ ::text").get()
data = json.loads(data)
return data
async def scrape_product(url: str) -> dict:
"""scrape a single stockx product page for product data"""
result = await scrapfly.async_scrape(
ScrapeConfig(
url=url,
cache=True,
asp=True,
)
)
data = find_hidden_data(result)
product = data["props"]["pageProps"]["initialState"]["productPageReducer"]["productData"]
return product
def update_url_parameter(url, **params):
"""update url query parameter of an url with new values"""
current_params = parse_qs(urlparse(url).query)
updated_query_params = urlencode({ **current_params,** params}, doseq=True)
return f"{url.split('?')[0]}?{updated_query_params}"
async def scrape_paging(url: str, max_pages: int = 10) -> List[Dict]:
print(f"scraping product discovery paging {url}")
# scrape first page
result_first_page = await scrapfly.async_scrape(ScrapeConfig(url=url, asp=True))
data_first_page = find_hidden_data(result_first_page)
data_first_page = data_first_page["props"]["pageProps"]["initialState"]["listingPageReducer"]["listingData"]
results = data_first_page["results"]
# find total page count
total_pages = data_first_page["pages"]
if max_pages and max_pages < total_pages:
total_pages = max_pages
# scrape remaining pages
print(f"scraping remaining total pages: {total_pages-1} concurrently")
to_scrape = [ScrapeConfig(update_url_parameter(url, page=page), asp=True) for page in range(2, total_pages + 1)]
async for result in scrapfly.concurrent_scrape(to_scrape):
data = find_hidden_data(result)
data = data["props"]["pageProps"]["initialState"]["listingPageReducer"]["listingData"]
results.extend(data["results"])
return results
async def example_run():
"""
this example run will scrape example product and sitemap for 5 newest items
save them to ./results/product.json and ./results/sitemap.json respectively
"""
out_dir = Path( __file__ ).parent / "results"
out_dir.mkdir(exist_ok=True)
product = await scrape_product("https://www.fashionphile.com/p/bottega-veneta-nappa-twisted-padded-intrecciato-curve-slide-sandals-36-black-1048096")
out_dir.joinpath("product.json").write_text(json.dumps(product, indent=2, ensure_ascii=False))
search = await scrape_paging("https://www.fashionphile.com/shop/discounted/all", max_pages=3)
out_dir.joinpath("categories.json").write_text(json.dumps(search, indent=2, ensure_ascii=False))
if __name__ == " __main__":
asyncio.run(example_run())