记录通过chales爬取‘京东到家’小程序里某沃尔玛线线上店的商品数据(mac系统)

1.安装、打开chales,配置charles。

1.1勾选Proxy->macOS Proxy选项,关闭默认的mac proxy设置。

 1.2勾选Proxy->Proxy Settings,弹出弹框。设置HTTP的代理端口为:6666(一般默认为:8888,可以自己定义)

 1.3勾选Proxy->SSL Proxying Settings,添加要抓包的域名。我们可以添加:*,匹配所有的。

2.手机端的配置。(以iso系统为例)

2.1点击连接的Wi-Fi的感叹号图标;点击最后一项:HTTP代理->配置代理;选择‘手动’,填入电脑的ip地址和刚刚设置chales的端口号:6666

 

3.https抓包的配置。

3.1因为要抓包的是https请求,所以我们还要安装证书。勾选Help->SSL Proxying->Install Charles Root Certificate。

3.2双击电脑端添加的charles证书,选择‘始终信任’。

3.3安装手机端的证书。勾选Help->SSL Proxying->Install Charles Root Certificate on a Mobile Device or Remote Browser。根据提示在手机端访问网址chls.pro/ssl。

3.4根据弹窗的提示,在手机端安装该证书。

  

3.5在‘通用->关于本机->证书信任设置’里选择完全信任该证书。(证书就是一套公钥私钥,所以手机和电脑端都要安装,并选择信任)

4.1点击圆形按钮,就可以追踪手机开始抓包了。

本文例子中是选择了一家沃尔玛超市,进入该店铺进行数据抓取。

 

4.2通过分析发现发现获取商品类目的url拼接规律:

url1 = 'https://daojia.jd.com/client?lat=22.56705&lng=113.95371&city_id=1607&deviceToken=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&deviceId=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&channel=wx_xcx&platform=5.0.0&platCode=H5&appVersion=5.0.0&xcxVersion=3.6.2&appName=paidaojia&deviceModel=appmodel&functionId=station%2FgetStationDetail&isForbiddenDialog=false&isNeedDealError=false&isNeedDealLogin=false&body=%7B%22storeId%22%3A%2211653731%22%2C%22skuId%22%3A%22%22%2C%22orgCode%22%3A%2281372%22%2C%22activityId%22%3A%22%22%2C%22promotionType%22%3A%22%22%2C%22lgt%22%3A113.95371%2C%22lat%22%3A22.56705%7D&afsImg=&business='

body里的内容,解码后为:

body = {"storeId":"11653731","skuId":"","orgCode":"81372","activityId":"","promotionType":"","lgt":113.95371,"lat":22.56705}

body里的数值不影响获取类目的获取。所以通过url1发送get方法就可以获取数据。

import requests

url = 'https://daojia.jd.com/client?lat=22.51424&lng=113.93068&city_id=1607&deviceToken=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&deviceId=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&channel=wx_xcx&platform=5.0.0&platCode=H5&appVersion=5.0.0&xcxVersion=3.6.2&appName=paidaojia&deviceModel=appmodel&functionId=storeIndexSearch%2FsearchByCategory&isForbiddenDialog=false&isNeedDealError=false&isNeedDealLogin=false&body=%7B%22storeId%22%3A%2211653731%22%2C%22orgCode%22%3A%2281372%22%2C%22skuId%22%3A%22%22%2C%22catIds%22%3A%5B%7B%22catId%22%3A%224644375%22%2C%22type%22%3A2%7D%5D%7D&afsImg=&business=undefined'
ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
headers = {'User-Agent': ua}
res = requests.get(url, headers=headers)
print(res.text)  # 即为返回的数据内容

部分数据展示:

4.3通过分析发现获取不同类目下商品的url拼接规律:

url2 = 'https://daojia.jd.com/client?lat=22.51424&lng=113.93068&city_id=1607&deviceToken=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&deviceId=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&channel=wx_xcx&platform=5.0.0&platCode=H5&appVersion=5.0.0&xcxVersion=3.6.2&appName=paidaojia&deviceModel=appmodel&functionId=storeIndexSearch%2FsearchByCategory&isForbiddenDialog=false&isNeedDealError=false&isNeedDealLogin=false&body=%7B%22storeId%22%3A%2211653731%22%2C%22orgCode%22%3A%2281372%22%2C%22skuId%22%3A%22%22%2C%22catIds%22%3A%5B%7B%22catId%22%3A%224644376%22%2C%22type%22%3A2%7D%5D%7D&afsImg=&business=undefined'

body里的内容,解码后为:

body = {"storeId":"11653731","orgCode":"81372","skuId":"","catIds":[{"catId":"4644376","type":2}]}

catId值可以从url1返回的数据提取,传入不同的catId值,就会返回对应该类目下商品的信息。

import requests
import time
from urllib.parse import quote

def get_product(cateid2):  # 传入二级类目的类目id值
    body = {
        "storeId": "11653731",
        "orgCode": "81372",
        "skuId": "",
        "catIds": [{"catId": cateid2, "type": 2}]}
    body = json.dumps(body)
    body = quote(body)
    base_url = 'https://daojia.jd.com/client?lat=22.51424&lng=113.93068&city_id=1607&deviceToken=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&deviceId=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&channel=wx_xcx&platform=5.0.0&platCode=H5&appVersion=5.0.0&xcxVersion=3.6.2&appName=paidaojia&deviceModel=appmodel&functionId=storeIndexSearch%2FsearchByCategory&isForbiddenDialog=false&isNeedDealError=false&isNeedDealLogin=false&body={}&afsImg=&business=undefined'.format(body)
    print(base_url)  # 根据不同的cateId拼接url
    ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
    headers = {'User-Agent': ua}
    res = requests.get(base_url, headers=headers)
    print(res.text)

 部分数据展示:

 4.4将数据整理好输出为表的格式:

filename = '{}.csv'.format(catename1)
csvfile = open(filename, 'a')
writer = csv.writer(csvfile)
writer.writerow(['商品名称', '价格(单位:元)', '月销量', '图片', '二级类目', '一级类目'])

for product in searchResultVOList:
    print(product)
    name = product['skuName']
    img = product['imgUrl']
    price = product['realTimePrice']
    sale = product['monthSales']
    writer.writerow([name, price, sale, img, catename2, catename1])

csvfile.close()

部分数据展示:

4.5完整代码见:https://github.com/HongDanni/jd_daojia 

原文地址:https://www.cnblogs.com/hongdanni/p/11662869.html