Nordstrom是一家领先的时尚零售商,居住在我们的一家同样流行的电子商务商店,该商店在全球范围内运营。这是一个受欢迎的网络刮擦目标,因为它提供了丰富的数据,并且在时装行业中的位置。
在本指南中,我们将使用Python查看Web刮擦Nordstrom。我们将介绍:
- Nordstrom产品数据刮擦。
- 产品发现和搜索。
为此,我们将使用流行的web scraping in Python工具httpx
和parsel
。要解析我们将使用hidden web data方法的数据。
nordstrom相对容易刮擦,所以让我们潜入!
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
为什么要刮擦Nordstrom?
Nordstrom是一家流行的时尚零售商,拥有庞大的产品目录。由于提供的丰富数据,这是网络刮擦的绝佳目标。它的受欢迎程度和数据集大小是了解时尚电子商务市场的好方法。这些数据可用于业务分析,市场分析和竞争情报。
有关网络刮擦用途的更多信息,请参见我们的web scraping use case hub。
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
刮擦预览
在本文中,我们将重点介绍Nordstrom产品数据和产品评论。以下是我们要收集的数据集的一些示例:
刮擦产品数据集
{
"id": 5846438,
"title": "SKIMS Stretch Cotton T-Shirt",
"type": "T-shirt/Tee",
"typeParent": "Tops",
"ageGroups": [
"ADULT"
],
"reviewAverageRating": 4.5,
"numberOfReviews": 652,
"brand": {
"brandName": "SKIMS",
"brandUrl": "/brands/skims--21197?origin=productBrandLink",
"hasBrandPage": false,
"imsBrandId": 74974321
},
"description": "A tried-and-true classic, this fitted T-shirt made from stretch-cotton jersey is from Kim Kardashian's highly sought-out SKIMS.",
"features": [
"21 1/2\" length (size Medium)",
"Crewneck",
"Short sleeves",
"90% cotton, 10% elastane",
"Machine wash, tumble dry",
"Imported",
"Item #6194916"
],
"gender": "Female",
"isAvailable": true,
"media": {
"5847438": {
"id": 5847438,
"colorId": "053",
"name": "LIGHT HEATHER GREY",
"url": "https://n.nordstrommedia.com/id/sr3/e354aaf8-5865-431b-b8d8-3cbccc6a2d83.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5847448": {
"id": 5847448,
"colorId": "053",
"name": "LIGHT HEATHER GREY",
"url": "https://n.nordstrommedia.com/id/sr3/df191e8d-4f2c-48f4-9144-e6b9dbede775.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5847458": {
"id": 5847458,
"colorId": "053",
"name": "LIGHT HEATHER GREY",
"url": "https://n.nordstrommedia.com/id/sr3/bca96a41-af1b-4736-89e3-e2facb3ec8ed.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5847468": {
"id": 5847468,
"colorId": "053",
"name": "LIGHT HEATHER GREY",
"url": "https://n.nordstrommedia.com/id/sr3/1b0051f1-f60e-4b4b-8f79-3fabd077e91d.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5847478": {
"id": 5847478,
"colorId": "053",
"name": "LIGHT HEATHER GREY",
"url": "https://n.nordstrommedia.com/id/sr3/86510e70-589b-440a-b66a-98982ce59740.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5847488": {
"id": 5847488,
"colorId": "053",
"name": "LIGHT HEATHER GREY",
"url": "https://n.nordstrommedia.com/id/sr3/d6ae4e0c-3b22-4dff-b528-d428005d8cd8.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5848438": {
"id": 5848438,
"colorId": "234",
"name": "SEDONA",
"url": "https://n.nordstrommedia.com/id/sr3/d64c4a4d-ca98-46af-8ff4-efd7460e3321.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5848448": {
"id": 5848448,
"colorId": "234",
"name": "SEDONA",
"url": "https://n.nordstrommedia.com/id/sr3/f1d6105b-9e75-49aa-bfdb-39ed6a0cd82a.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5848458": {
"id": 5848458,
"colorId": "234",
"name": "SEDONA",
"url": "https://n.nordstrommedia.com/id/sr3/04936587-02d9-41c7-b36f-b7f90144df6e.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5849438": {
"id": 5849438,
"colorId": "242",
"name": "UMBER",
"url": "https://n.nordstrommedia.com/id/sr3/85f4e2d8-00de-41f9-b777-2169bb799970.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5849448": {
"id": 5849448,
"colorId": "242",
"name": "UMBER",
"url": "https://n.nordstrommedia.com/id/sr3/4e2bffa2-fb87-416c-8438-a922d593423f.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5849458": {
"id": 5849458,
"colorId": "242",
"name": "UMBER",
"url": "https://n.nordstrommedia.com/id/sr3/ca5f4ff8-7587-48cc-8914-818ee6320b9c.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5850438": {
"id": 5850438,
"colorId": "251",
"name": "CAMEL",
"url": "https://n.nordstrommedia.com/id/sr3/0762da9a-4326-46fd-9b84-6db33035c0ea.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5850448": {
"id": 5850448,
"colorId": "251",
"name": "CAMEL",
"url": "https://n.nordstrommedia.com/id/sr3/9f20433a-3d03-4893-87f9-2fd90f05c2b5.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5850458": {
"id": 5850458,
"colorId": "251",
"name": "CAMEL",
"url": "https://n.nordstrommedia.com/id/sr3/32d39f3b-88e8-4ee2-bb15-7723bed651c8.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5850468": {
"id": 5850468,
"colorId": "251",
"name": "CAMEL",
"url": "https://n.nordstrommedia.com/id/sr3/da666e38-7c2d-408e-9874-f30f094ccd9e.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5850478": {
"id": 5850478,
"colorId": "251",
"name": "CAMEL",
"url": "https://n.nordstrommedia.com/id/sr3/97828599-558b-48a5-8e03-35aeec7f6dbe.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5850488": {
"id": 5850488,
"colorId": "251",
"name": "CAMEL",
"url": "https://n.nordstrommedia.com/id/sr3/ef50a5bb-8f20-428d-8d64-0c7f9dd80776.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5851438": {
"id": 5851438,
"colorId": "301",
"name": "DEEP SEA",
"url": "https://n.nordstrommedia.com/id/sr3/8a2ed339-427b-4f93-9a49-762a43145d42.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5851448": {
"id": 5851448,
"colorId": "301",
"name": "DEEP SEA",
"url": "https://n.nordstrommedia.com/id/sr3/406118cc-c17a-42a5-842c-c12a54c19b39.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5852438": {
"id": 5852438,
"colorId": "339",
"name": "MINERAL",
"url": "https://n.nordstrommedia.com/id/sr3/a6c49b4c-1849-4c9e-895e-2804c4a0d01b.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5852448": {
"id": 5852448,
"colorId": "339",
"name": "MINERAL",
"url": "https://n.nordstrommedia.com/id/sr3/3c3820b0-0fe3-4869-bfcb-040917a78276.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5852458": {
"id": 5852458,
"colorId": "339",
"name": "MINERAL",
"url": "https://n.nordstrommedia.com/id/sr3/0544d615-d912-4fed-8e35-95bd9fdf753f.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5852468": {
"id": 5852468,
"colorId": "339",
"name": "MINERAL",
"url": "https://n.nordstrommedia.com/id/sr3/df96797a-9a3d-4070-83c5-cd7d94dd1260.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5853438": {
"id": 5853438,
"colorId": "400",
"name": "COBALT",
"url": "https://n.nordstrommedia.com/id/sr3/95c440cd-18ea-47e0-a48f-6f97f1e1c0fc.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5854438": {
"id": 5854438,
"colorId": "446",
"name": "KYANITE",
"url": "https://n.nordstrommedia.com/id/sr3/b0359253-5e23-4619-9123-34dfb35063e6.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5854448": {
"id": 5854448,
"colorId": "446",
"name": "KYANITE",
"url": "https://n.nordstrommedia.com/id/sr3/81a2918e-5643-4d63-8850-d9d8654b62af.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5854458": {
"id": 5854458,
"colorId": "446",
"name": "KYANITE",
"url": "https://n.nordstrommedia.com/id/sr3/8e2b9d0f-8b8f-4835-9a57-bb197f95631d.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5855438": {
"id": 5855438,
"colorId": "8",
"name": "525",
"url": "https://n.nordstrommedia.com/id/sr3/31cdee52-d41a-46a7-8691-3ae1e0c53fb7.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5855448": {
"id": 5855448,
"colorId": "8",
"name": "525",
"url": "https://n.nordstrommedia.com/id/sr3/a1843dad-b30c-4031-8d36-42c47934572f.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5855458": {
"id": 5855458,
"colorId": "8",
"name": "525",
"url": "https://n.nordstrommedia.com/id/sr3/e2543102-670e-40e8-acb6-916ea91f1515.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5855468": {
"id": 5855468,
"colorId": "8",
"name": "525",
"url": "https://n.nordstrommedia.com/id/sr3/3daefa94-9c8a-41f6-967e-f85b80ba3ebf.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5855478": {
"id": 5855478,
"colorId": "8",
"name": "525",
"url": "https://n.nordstrommedia.com/id/sr3/95c440cd-18ea-47e0-a48f-6f97f1e1c0fc.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5855488": {
"id": 5855488,
"colorId": "8",
"name": "525",
"url": "https://n.nordstrommedia.com/id/sr3/ad9d1fcc-a0a0-4856-8345-de54e3b6b54f.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5856438": {
"id": 5856438,
"colorId": "603",
"name": "SANGRIA",
"url": "https://n.nordstrommedia.com/id/sr3/aaa6a78e-f7d8-46f3-b51e-533642b5ea02.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5856448": {
"id": 5856448,
"colorId": "603",
"name": "SANGRIA",
"url": "https://n.nordstrommedia.com/id/sr3/3d680521-dc9e-4f07-a634-e02043e78910.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5856458": {
"id": 5856458,
"colorId": "603",
"name": "SANGRIA",
"url": "https://n.nordstrommedia.com/id/sr3/fbc8e722-af04-403f-a8fa-e938d56da1f3.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5856468": {
"id": 5856468,
"colorId": "603",
"name": "SANGRIA",
"url": "https://n.nordstrommedia.com/id/sr3/4f2e699b-125f-484e-8873-09f72a2fa40a.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5856478": {
"id": 5856478,
"colorId": "603",
"name": "SANGRIA",
"url": "https://n.nordstrommedia.com/id/sr3/1559d8ec-d03e-416c-9d27-c6d31151012f.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5856488": {
"id": 5856488,
"colorId": "603",
"name": "SANGRIA",
"url": "https://n.nordstrommedia.com/id/sr3/44ffdcad-614c-4329-ba6b-65244873e200.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5857438": {
"id": 5857438,
"colorId": "690",
"name": "ROSE CLAY",
"url": "https://n.nordstrommedia.com/id/sr3/35a9863f-feda-463c-aedf-a988329754c8.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5857448": {
"id": 5857448,
"colorId": "690",
"name": "ROSE CLAY",
"url": "https://n.nordstrommedia.com/id/sr3/2241bfc4-be0f-4645-a350-7d19aafce7ae.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5857458": {
"id": 5857458,
"colorId": "690",
"name": "ROSE CLAY",
"url": "https://n.nordstrommedia.com/id/sr3/3c76bafa-dda1-4069-9d55-4deddd58a70f.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5857468": {
"id": 5857468,
"colorId": "690",
"name": "ROSE CLAY",
"url": "https://n.nordstrommedia.com/id/sr3/a3e5cf6b-0e43-455e-aba4-0b7093e0ac60.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5857478": {
"id": 5857478,
"colorId": "690",
"name": "ROSE CLAY",
"url": "https://n.nordstrommedia.com/id/sr3/357882e4-c176-4c98-9601-39ee0299452a.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5857488": {
"id": 5857488,
"colorId": "690",
"name": "ROSE CLAY",
"url": "https://n.nordstrommedia.com/id/sr3/b9a9588e-a241-43b8-b907-0fc5d16d959c.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5858438": {
"id": 5858438,
"colorId": "900",
"name": "BONE",
"url": "https://n.nordstrommedia.com/id/sr3/eb5b0ed4-41b9-439b-a56d-a9f549892451.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5859438": {
"id": 5859438,
"colorId": "003",
"name": "SOOT",
"url": "https://n.nordstrommedia.com/id/sr3/2c5c5fd6-3df6-4e30-a5af-893041f219dc.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5860438": {
"id": 5860438,
"colorId": "203",
"name": "GARNET",
"url": "https://n.nordstrommedia.com/id/sr3/9b140781-5301-4137-b94e-fe10b7a674b4.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
}
},
"variants": {
"5871416": {
"id": 5871416,
"sizeId": "xx-small",
"colorId": "339",
"totalQuantityAvailable": 1,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"5871419": {
"id": 5871419,
"sizeId": "medium",
"colorId": "339",
"totalQuantityAvailable": 9,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"5871420": {
"id": 5871420,
"sizeId": "large",
"colorId": "339",
"totalQuantityAvailable": 10,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"5871421": {
"id": 5871421,
"sizeId": "x-large",
"colorId": "339",
"totalQuantityAvailable": 19,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"5871422": {
"id": 5871422,
"sizeId": "plus-2 x",
"colorId": "339",
"totalQuantityAvailable": 19,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"5871423": {
"id": 5871423,
"sizeId": "plus-3 x",
"colorId": "339",
"totalQuantityAvailable": 14,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"5871424": {
"id": 5871424,
"sizeId": "plus-4 x",
"colorId": "339",
"totalQuantityAvailable": 23,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"33855448": {
"id": 33855448,
"sizeId": "small",
"colorId": "900",
"totalQuantityAvailable": 319,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855449": {
"id": 33855449,
"sizeId": "medium",
"colorId": "900",
"totalQuantityAvailable": 437,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855450": {
"id": 33855450,
"sizeId": "large",
"colorId": "900",
"totalQuantityAvailable": 626,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855451": {
"id": 33855451,
"sizeId": "x-large",
"colorId": "900",
"totalQuantityAvailable": 273,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855452": {
"id": 33855452,
"sizeId": "plus-2 x",
"colorId": "900",
"totalQuantityAvailable": 105,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855454": {
"id": 33855454,
"sizeId": "xx-small",
"colorId": "900",
"totalQuantityAvailable": 38,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855455": {
"id": 33855455,
"sizeId": "plus-3 x",
"colorId": "900",
"totalQuantityAvailable": 56,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855456": {
"id": 33855456,
"sizeId": "plus-4 x",
"colorId": "900",
"totalQuantityAvailable": 67,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855464": {
"id": 33855464,
"sizeId": "x-small",
"colorId": "003",
"totalQuantityAvailable": 1,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"33855477": {
"id": 33855477,
"sizeId": "large",
"colorId": "003",
"totalQuantityAvailable": 720,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"33855478": {
"id": 33855478,
"sizeId": "x-large",
"colorId": "003",
"totalQuantityAvailable": 317,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"33855479": {
"id": 33855479,
"sizeId": "plus-2 x",
"colorId": "003",
"totalQuantityAvailable": 166,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"33855480": {
"id": 33855480,
"sizeId": "xx-small",
"colorId": "003",
"totalQuantityAvailable": 22,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"33855482": {
"id": 33855482,
"sizeId": "plus-3 x",
"colorId": "003",
"totalQuantityAvailable": 11,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"33855483": {
"id": 33855483,
"sizeId": "plus-4 x",
"colorId": "003",
"totalQuantityAvailable": 18,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"36450158": {
"id": 36450158,
"sizeId": "medium",
"colorId": "053",
"totalQuantityAvailable": 241,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450160": {
"id": 36450160,
"sizeId": "large",
"colorId": "053",
"totalQuantityAvailable": 137,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450161": {
"id": 36450161,
"sizeId": "x-large",
"colorId": "053",
"totalQuantityAvailable": 69,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450162": {
"id": 36450162,
"sizeId": "plus-2 x",
"colorId": "053",
"totalQuantityAvailable": 40,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450163": {
"id": 36450163,
"sizeId": "xx-small",
"colorId": "053",
"totalQuantityAvailable": 16,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450164": {
"id": 36450164,
"sizeId": "plus-3 x",
"colorId": "053",
"totalQuantityAvailable": 23,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450165": {
"id": 36450165,
"sizeId": "plus-4 x",
"colorId": "053",
"totalQuantityAvailable": 27,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450185": {
"id": 36450185,
"sizeId": "x-small",
"colorId": "053",
"totalQuantityAvailable": 46,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450186": {
"id": 36450186,
"sizeId": "small",
"colorId": "053",
"totalQuantityAvailable": 197,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"38558224": {
"id": 38558224,
"sizeId": "plus-2 x",
"colorId": "446",
"totalQuantityAvailable": 22,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "446",
"value": "Kyanite",
"sizes": "_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5854438,
5854448,
5854458
],
"swatch": "https://n.nordstrommedia.com/id/sr3/2f637c12-349c-4506-9021-70e078f2ffe4.jpeg?crop=fit&w=31&h=31"
}
},
"38558226": {
"id": 38558226,
"sizeId": "plus-3 x",
"colorId": "446",
"totalQuantityAvailable": 5,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "446",
"value": "Kyanite",
"sizes": "_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5854438,
5854448,
5854458
],
"swatch": "https://n.nordstrommedia.com/id/sr3/2f637c12-349c-4506-9021-70e078f2ffe4.jpeg?crop=fit&w=31&h=31"
}
},
"38558227": {
"id": 38558227,
"sizeId": "plus-4 x",
"colorId": "446",
"totalQuantityAvailable": 7,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "446",
"value": "Kyanite",
"sizes": "_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5854438,
5854448,
5854458
],
"swatch": "https://n.nordstrommedia.com/id/sr3/2f637c12-349c-4506-9021-70e078f2ffe4.jpeg?crop=fit&w=31&h=31"
}
}
}
}
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
设置
对于此刮刀,我们将使用hidden web data scraping方法。我们将收集HTML页面并提取隐藏的JSON数据集,然后使用JSON解析工具解析:
- httpx-功能强大的HTTP客户端,我们将用于检索HTML页面。
- parsel -HTML解析器我们将使用它来提取隐藏的JSON数据集。
- nested-lookup -json/dict Parser,它将帮助我们在大型JSON数据集中找到特定的键。
- jmespath- JSON查询引擎,我们将使用该引擎将JSON数据集简化为产品价格,图像等重要位。有关更多信息,请参见我们的introduction to parsing JSON with JMESPath。
所有这些软件包都可以使用Python的pip
台命令:
安装
$ pip install httpx parsel jmespath nested-lookup
对于Scrapfly users,每个代码示例也有一个Scrapfly SDK版本。 SDK也可以使用pip
安装:
$ pip install "scrapfly-sdk[all]"
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
刮擦Nordstrom产品数据
让我们从刮擦单个产品的产品数据开始。为此,让我们看一个示例产品页面,例如:
nordstrom.com/s/nike-phoenix-fleece-crewneck-sweatshirt/
我们可以使用CSS selectors或XPath分析HTML数据,但是由于Nordstrom使用React JavaScript框架来为其网站供电,我们可以直接从页面源提取数据集:
如果我们为唯一的产品标识符文本打开页面源和CTRL+F(例如描述或标题),我们可以看到一个隐藏的JSON数据集。在网络刮擦中,这称为隐藏的网络数据刮擦,让我们看一下如何在Python中刮擦它。
我们的刮板过程看起来像这样:
- 使用
httpx
检索产品的HTML页面。 - 使用
parsel
和xpath从<script>
标记中找到隐藏的JSON数据集。 - 使用
json.loads()
加载JSON数据集并使用nested-lookup
查找产品字段
在Python中,此刮刀看起来像这样:
python p>
刮擦
import asyncio
import json
import httpx
from parsel import Selector
from nested_lookup import nested_lookup
# setup httpx client with http2 enabled and browser-like headers to avoid being blocked:
client = httpx.AsyncClient(
http2=True,
headers={
"User-Agent": "Mozilla/4.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=-1.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
}
)
def find_hidden_data(html) -> dict:
"""extract hidden web cache from page html"""
# use XPath to find script tag with data
data = Selector(html).xpath("//script[contains(.,' __INITIAL_CONFIG__')]/text()").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
async def scrape_product(url: str):
"""scrape Nordstrom.com product page for product data"""
response = await client.get(url)
# find all hidden dataset:
data = find_hidden_data(response.text)
# extract only product data from the dataset
# find first key "stylesById" and take first value (which is the current product)
product = nested_lookup("stylesById", data)
product = list(product[0].values())[0]
return product
# example scrape run:
print(asyncio.run(scrape_product("https://www.nordstrom.com/s/nike-phoenix-fleece-crewneck-sweatshirt/6665302")))
import asyncio
import json
from nested_lookup import nested_lookup
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
def find_hidden_data(result: ScrapeApiResponse) -> dict:
"""extract hidden web cache from page html"""
# use XPath to find script tag with data
data = result.selector.xpath("//script[contains(.,' __INITIAL_CONFIG__')]/text()").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
async def scrape_product(url: str):
response = await client.scrape(ScrapeConfig(
url=url,
asp=True, # enable anti-scraping-protection bypass
cache=True, # enable cache while we develop
debug=True, # enable debug mode while we develop
))
# find all hidden dataset:
data = find_hidden_data(response.text)
# extract only product data from the dataset
# find first key "stylesById" and take first value (which is the current product)
product = nested_lookup("stylesById", data)
product = list(product[0].values())[0]
return product
# example scrape run:
print(asyncio.run(scrape_product("https://www.nordstrom.com/s/nike-phoenix-fleece-crewneck-sweatshirt/6665302")))
在仅几行Python代码中,我们在Nordstrom上获得了整个产品数据集!但是,如果我们要进行一些分析或数据存储,则该数据集很大,并且很难通过数据管道摄入。因此,接下来,让我们使用Jmespath将数据集简化为最重要的值,例如定价,图像和变体数据。
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
与Jmespath解析
jmespath是一种json查询语言,由于python词典等于json对象,我们可以在nordstrom数据解析中使用jmespath。
我们将使用JMespath数据重塑功能,该功能允许指定密钥映射以减少数据集。例如:
import jmespath
data = {
"id": "123456",
"productTitle": "Product Title",
"type": "sweater",
"unimportant": "foobar",
"photos": {
"desktop": "http://example.com/photo.jpg",
"mobile": "http://example.com/photo-small.jpg",
},
}
# jmespath search takes a query string and a data object.
# here we use `{}` remapping feature to rename keys of the original dataset
reduced = jmespath.search(
"""{
id: id,
title: productTitle,
type: type,
photo: photos.desktop
}""",
data,
)
print(reduced)
{"id": "123456", "title": "Product Title", "type": "sweater", "photo": "http://example.com/photo.jpg"}
这个功能强大的工具使我们可以轻松地重塑刮擦的数据集。因此,让我们使用它来重塑我们刚刚刮擦的Nordstrom产品数据集:
import jmespath
def parse_product(data: dict) -> dict:
# parse product basic data like id, name, features etc.
product = jmespath.search(
"""{
id: id,
title: productTitle,
type: productTypeName,
typeParent: productTypeParentName,
ageGroups: ageGroups,
reviewAverageRating: reviewAverageRating,
numberOfReviews: numberOfReviews,
brand: brand,
description: sellingStatement,
features: features,
gender: gender,
isAvailable: isAvailable
}""",
data,
)
# product variants have their own colors, prices and photos:
prices_by_sku = data["price"]["bySkuId"]
colors_by_id = data["filters"]["color"]["byId"]
product["media"] = {}
for media_id, media in data["styleMedia"]["byId"].items():
product["media"][media_id] = jmespath.search(
"""{
id: id,
colorId: colorId,
name: colorName,
url: imageMediaUri.largeDesktop
}""",
media,
)
# Each product has SKUs(Stock Keeping Units) which are the actual variants:
product["variants"] = {}
for sku, sku_data in data["skus"]["byId"].items():
# get basic variant data
parsed = jmespath.search(
"""{
id: id,
sizeId: sizeId,
colorId: colorId,
totalQuantityAvailable: totalQuantityAvailable
}""",
sku_data,
)
# get variant price from
parsed["price"] = prices_by_sku[sku]["regular"]["price"]
# get variant color data
parsed["color"] = jmespath.search(
"""{
id: id,
value: value,
sizes: isAvailableWith,
mediaIds: styleMediaIds,
swatch: swatchMedia.desktop
}""",
colors_by_id[parsed["colorId"]],
)
product["variants"][sku] = parsed
return product
这可能看起来很复杂,但我们要做的就是使用JMespath映射新键的原始数据集键。现在,我们的刮板可以刮擦我们可以轻松摄入数据管道的精美而整洁的产品数据集!
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
寻找产品
现在,我们可以刮擦单个Nordstrom产品,我们需要找到刮擦产品URL的产品。我们可以找到所需的产品并手动输入其URL,但要扩大我们的刮板,我们找到了刮擦产品类别或搜索。
为此,我们将使用与每个类别或搜索结果页面相同的隐藏数据刮擦方法包含一个带有产品预览数据(例如价格,标题,图像等)和产品页面URL的隐藏数据集。 P>
例如,让我们看一下Nordstrom搜索页面之一:
nordstrom.com/sr?origin=keywordsearch&keyword=indigo
我们可以看到每个搜索(或类别)页面都是由几页制成的。因此,我们也需要刮擦分页。
要刮擦它,我们将使用一种非常相似的方法来刮擦产品页面:
- 刮擦第一个搜索/类别页面html。
- 使用
parsel
和XPath查找隐藏的Web数据。 - 使用
nested-lookup
从隐藏数据集中提取产品预览数据和分页信息。 - 计算页面总数并刮擦它们。
让我们看看这在Python中的工作方式:
python p>
刮擦
import asyncio
import json
from typing import Dict, List
from urllib.parse import parse_qs, urlencode, urlparse
import httpx
from nested_lookup import nested_lookup
from parsel import Selector
# setup httpx client with http2 enabled and browser-like headers to avoid being blocked:
client = httpx.AsyncClient(
http2=True,
headers={
"User-Agent": "Mozilla/4.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=-1.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
}
)
def find_hidden_data(html) -> dict:
"""extract hidden web cache from page html"""
# use XPath to find script tag with data
data = Selector(html).xpath("//script[contains(.,' __INITIAL_CONFIG__')]/text()").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
def update_url_parameter(url, **params):
"""update url query parameter of an url with new values"""
current_params = parse_qs(urlparse(url).query)
updated_query_params = urlencode({ **current_params,** params}, doseq=True)
return url[: url.find("?")] + "?" + updated_query_params
async def scrape_search(url: str, max_pages: int = 10) -> List[Dict]:
"""Scrape Nordstrom search or category url for product preview data"""
print(f"scraping first search page: {url}")
first_page = await client.get(url)
# parse first page for product search data and total amount of pages:
data = find_hidden_data(first_page.text)
_first_page_results = nested_lookup("productResults", data)[0]
products = list(_first_page_results["productsById"].values())
paging_info = _first_page_results["query"]
total_pages = paging_info["pageCount"]
if max_pages and max_pages < total_pages:
total_pages = max_pages
# then scrape other pages concurrently:
print(f" scraping remaining {total_pages - 1} search pages")
_other_pages = [client.get(update_url_parameter(url, page=page)) for page in range(2, total_pages + 1)]
for response in asyncio.as_completed(_other_pages):
response = await response
if not response.status_code != 200:
print(f'!!! scrape page {response.url} got blocked; skipping')
continue
data = find_hidden_data(response.text)
data = nested_lookup("productResults", data)[0]
products.extend(list(data["productsById"].values()))
return products
# example scrape run for search of "indigo" keyword with max 2 pages:
print(asyncio.run(scrape_search("https://www.nordstrom.com/sr?origin=keywordsearch&keyword=indigo", max_pages=2))
import asyncio
import json
from typing import Dict, List
from urllib.parse import parse_qs, urlencode, urlparse
from nested_lookup import nested_lookup
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
def find_hidden_data(result: ScrapeApiResponse) -> dict:
"""extract hidden web cache from page html"""
# use XPath to find script tag with data
data = result.selector.xpath("//script[contains(.,' __INITIAL_CONFIG__')]/text()").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
def update_url_parameter(url, **params):
"""update url query parameter of an url with new values"""
current_params = parse_qs(urlparse(url).query)
updated_query_params = urlencode({ **current_params,** params}, doseq=True)
return url[: url.find("?")] + "?" + updated_query_params
async def scrape_search(url: str, max_pages: int = 10) -> List[Dict]:
"""Scrape StockX search"""
print(f"scraping first search page: {url}")
first_page = await scrapfly.async_scrape(
ScrapeConfig(
url=url,
country="US",
asp=True,
debug=True,
cache=True,
)
)
# parse first page for product search data and total amount of pages:
data = find_hidden_data(first_page)
_first_page_results = nested_lookup("productResults", data)[0]
products = list(_first_page_results["productsById"].values())
paging_info = _first_page_results["query"]
total_pages = paging_info['pageCount']
if max_pages and max_pages < total_pages:
total_pages = max_pages
# then scrape other pages concurrently:
print(f" scraping remaining {total_pages - 1} search pages")
_other_pages = [
ScrapeConfig(
url=update_url_parameter(url, page=page),
country="US",
asp=True,
)
for page in range(2, total_pages + 1)
]
async for result in scrapfly.concurrent_scrape(_other_pages):
data = find_hidden_data(result)
data = nested_lookup("productResults", data)[0]
products.extend(list(data["productsById"].values()))
return products
# example scrape run for search of "indigo" keyword with max 2 pages:
print(asyncio.run(scrape_search("https://www.nordstrom.com/sr?origin=keywordsearch&keyword=indigo", max_pages=2))
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
避免用刮擦蝇阻塞
nordstrom在阻止网络刮擦方面有些臭名昭著,因此,要扩大我们需要使用Proxies或other tools to avoid scraper blocking的刮擦剂超出本指南的少量刮擦。
刮擦API是扩展网络刮刀并避免被阻止的理想工具。这是我们在本指南中使用的工具的替换,并带有刮刀电源的功能:
- Millions of Residential Proxies
- Anti Scraping Protection bypass
- Javascript rendering and headless cloud browsers
- Web dashboard for monitoring and managing scrapers
所有这些工具都可以通过Python SDK轻松访问:
from scrapfly import ScrapeConfig, ScrapflyClient
client = ScrapflyClient(key="")
result = client.scrape(ScrapeConfig(
url="https://www.nordstrom.com/sr?origin=keywordsearch&keyword=indigo",
# enable scraper blocking service bypass
asp=True
# optional - render javascript using headless browsers:
render_js=True,
))
print(result.content)
有关Web刮擦Nordstrom的更多信息,请查看Full Scraper Code部分。
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
常问问题
要包装本文,让我们看一下有关刮擦Nordstrom的一些常见问题:
刮擦Nordstrom是合法的吗?
是。关于Nordstrom的公共数据是完全合法的。但是,应注意刮擦速度和刮擦用户评论,因为它们可能包含受版权保护的数据,例如图像,这些图像可能需要根据国家/P>允许存储。
Nordstrom可以被爬行吗?
是。像许多电子商务网站一样,Nordstrom借助网络爬网,因为它在网站上都有许多产品参考。请注意,爬行比我们在本教程中介绍的直接网络刮擦要大得多,因此不建议这样做。相关:What's the difference between Web Scraping and Crawling?
概括
在本网络刮擦指南中,我们研究了如何刮擦Nordstrom-流行的时尚电子商务商店。
为此,我们将Python与httpx
,parsel
,nested-lookup
和jmespath
以及隐藏的Web数据刮擦方法一起使用。我们收集了HTML页面并提取了隐藏的React框架数据,以找到只有几行Python代码的产品数据字段。
为了避免阻塞,我们已经看了刮擦fly-一个可用于扩展网络刮刀并避免被阻止的网络刮擦API。免费尝试!
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
Get Your FREE API KeyDiscover ScrapFly
<! - kg-card-end:markdown-> <! - kg-card-begin:markdown->
完整的刮板代码
这是使用Python和scrapfly Python SDK的完整Nordstrom刮刀:
ð此代码仅应用作参考。要大规模从Nordstrom刮擦数据,您需要将其调整为您的偏好和环境
import asyncio
import os
import json
from pathlib import Path
from typing import Dict, List
from urllib.parse import parse_qs, urlencode, urlparse
from nested_lookup import nested_lookup
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
import jmespath
scrapfly = ScrapflyClient(key=os.environ["SCRAPFLY_KEY"], max_concurrency=10)
def find_hidden_data(result: ScrapeApiResponse) -> dict:
"""extract hidden web cache from page html"""
data = result.selector.xpath("//script[contains(.,' __INITIAL_CONFIG__')]/text()").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
def parse_product(data: dict) -> dict:
# parse product basic data like id, name, features etc.
product = jmespath.search(
"""{
id: id,
title: productTitle,
type: productTypeName,
typeParent: productTypeParentName,
ageGroups: ageGroups,
reviewAverageRating: reviewAverageRating,
numberOfReviews: numberOfReviews,
brand: brand,
description: sellingStatement,
features: features,
gender: gender,
isAvailable: isAvailable
}""",
data,
)
# product variants have their own colors, prices and photos:
prices_by_sku = data["price"]["bySkuId"]
colors_by_id = data["filters"]["color"]["byId"]
product["media"] = {}
for media_id, media in data["styleMedia"]["byId"].items():
product["media"][media_id] = jmespath.search(
"""{
id: id,
colorId: colorId,
name: colorName,
url: imageMediaUri.largeDesktop
}""",
media,
)
# Each product has SKUs(Stock Keeping Units) which are the actual variants:
product["variants"] = {}
for sku, sku_data in data["skus"]["byId"].items():
# get basic variant data
parsed = jmespath.search(
"""{
id: id,
sizeId: sizeId,
colorId: colorId,
totalQuantityAvailable: totalQuantityAvailable
}""",
sku_data,
)
# get variant price from
parsed["price"] = prices_by_sku[sku]["regular"]["price"]
# get variant color data
parsed["color"] = jmespath.search(
"""{
id: id,
value: value,
sizes: isAvailableWith,
mediaIds: styleMediaIds,
swatch: swatchMedia.desktop
}""",
colors_by_id[parsed["colorId"]],
)
product["variants"][sku] = parsed
return product
async def scrape_product(url: str) -> dict:
"""scrape a single stockx product page for product data"""
result = await scrapfly.async_scrape(
ScrapeConfig(
url=url,
asp=True,
cache=True,
)
)
data = find_hidden_data(result)
# extract all products datasets from page cache
product = nested_lookup("stylesById", data)
product = list(product[0].values())[0]
return parse_product(product)
def update_url_parameter(url, **params):
"""update url query parameter of an url with new values"""
current_params = parse_qs(urlparse(url).query)
updated_query_params = urlencode({ **current_params,** params}, doseq=True)
return url[: url.find("?")] + "?" + updated_query_params
async def scrape_search(url: str, max_pages: int = 10) -> List[Dict]:
"""Scrape StockX search"""
print(f"scraping first search page: {url}")
first_page = await scrapfly.async_scrape(
ScrapeConfig(
url=url,
asp=True,
cache=True,
)
)
# parse first page for product search data and total amount of pages:
data = find_hidden_data(first_page)
_first_page_results = nested_lookup("productResults", data)[0]
products = list(_first_page_results["productsById"].values())
paging_info = _first_page_results["query"]
total_pages = paging_info["pageCount"]
if max_pages and max_pages < total_pages:
total_pages = max_pages
# then scrape other pages concurrently:
print(f" scraping remaining {total_pages - 1} search pages")
_other_pages = [
ScrapeConfig(
url=update_url_parameter(url, page=page),
country="US",
asp=True,
)
for page in range(2, total_pages + 1)
]
async for result in scrapfly.concurrent_scrape(_other_pages):
data = find_hidden_data(result)
data = nested_lookup("productResults", data)[0]
products.extend(list(data["productsById"].values()))
return products
async def example_run():
"""
this example run will scrape example product and 2 pages of search results and
save them to ./results/product.json and ./results/search.json respectively
"""
out_dir = Path( __file__ ).parent / "results"
out_dir.mkdir(exist_ok=True)
product = await scrape_product("https://www.nordstrom.com/s/nike-phoenix-fleece-crewneck-sweatshirt/6665302?page=2")
out_dir.joinpath("product.json").write_text(json.dumps(product, indent=2, ensure_ascii=False))
search = await scrape_search("https://www.nordstrom.com/sr?origin=keywordsearch&keyword=foo", max_pages=2)
out_dir.joinpath("search.json").write_text(json.dumps(search, indent=2, ensure_ascii=False))
if __name__ == " __main__":
asyncio.run(example_run())
<! - kg-card-end:markdown-> <! - kg-card-begin:html-> {
&quot“ @context&quot” https://schema.org
&quot@type“:;
&quot“主要态度”:[
{
&quot@type;
&quot“ name”:“刮擦Nordstrom是合法的?
&quot“ Accessedanswer&quot”:{
&quot@type':&quot;
&quot“ text”:&quot'
是。关于Nordstrom的公共数据是完全合法的。但是,应注意刮擦速度和刮擦用户评论,因为它们可能包含受版权保护的数据,例如图像,这些图像可能需要根据国家/地区允许存储。
'}
},
{
&quot@type; &quot“ name”:“ nordstrom可以被爬行? &quot“ Accessedanswer&quot”:{
&quot@type':&quot; &quot“ text”:&quot'
是。像许多电子商务网站一样,Nordstrom借助网络爬网,因为它在网站上都有许多产品参考。请注意,与直接网络刮擦我们在本教程中所涵盖的直接网络刮擦相比,爬网的资源大大要大得多,因此不建议使用它。相关:网络刮擦和爬行之间的区别?
'}
}
]
} <! - kg-card-end:html->