python 抓取美丽说店铺的宝贝图片及详细信息的实现(爬虫)

对于页面的抓取,我们使用的是requests,现在大部分的网站都支持动态加载,我们在firefox f12后查找动态的url :http://www.meilishuo.com/aj/shop_list/goods?frame=1&page=0&shop_id=1001072849,这里的frame是变化的,因此我们只需要请求该网址即可,在请求的header中出现nt 参数,而且nt参数是变化的,我们猜测这可能是随时间变化的,而且是有有效期的;我们的工作是如何取得第一次的nt值?我们在访问http://www.meilishuo.com/shop/1001072849 返回的页面中找到了nt的值,ok 工作顺利解决

#-*- coding:utf-8 -*-
import re
import requests
import codecs
import simplejson


if __name__=="__main__":
	session=requests.Session()
	search_header={'Host':'www.meilishuo.com',
				   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0',
				   'Accept':'application/json, text/javascript, */*; q=0.01',
				   'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
				   'Accept-Encoding':'gzip, deflate',
				   'X-Requested-With':'XMLHttpRequest',#异步加载ajax
				   'Referer':'http://www.meilishuo.com/shop/1001072849',
				   'Connection':'keep-alive'}
	response=requests.get('http://www.meilishuo.com/shop/1001072849?frm=rate_to_shop')

	info=re.search('"nt":"(.+?)",',response.content)
	search_header['nt']=info.group(1)#在header中增加nt选项
	info1=re.search('<script>Meilishuo.config.poster0 = (.+?);fml.vars.notFluid = true;</script>',response.content)#取得静态页面的info
	b=simplejson.loads(info1.group(1))
	totalNum = b['totalNum']#取得页数
	page = int(totalNum)/20
	for i in range(page+1):
		a=requests.get('http://www.meilishuo.com/aj/shop_list/goods?frame='+str(i)+'&page=0&shop_id=1001072849',headers=search_header)
		print a.headers
		j_a=simplejson.loads(a.content)
		print len(j_a['tInfo'])

 未完待续,接下来的就是要把宝贝的url保存下来并保存为为本地图片

for key in j_a['tInfo']:
                r=requests.get(key['goods_img'])
                with open(key['goods_title']+".jpg","wb") as title:
                        title.write(r.content)

 

原文地址:https://www.cnblogs.com/ggbond1988/p/4890497.html