用scrapy爬取京东商城的商品信息

软件环境：

 1 gevent (1.2.2)
 2 greenlet (0.4.12)
 3 lxml (4.1.1)
 4 pymongo (3.6.0)
 5 pyOpenSSL (17.5.0)
 6 requests (2.18.4)
 7 Scrapy (1.5.0)
 8 SQLAlchemy (1.2.0)
 9 Twisted (17.9.0)
10 wheel (0.30.0)

1.创建爬虫项目

2创建京东网站爬虫. 进入爬虫项目目录，执行命令：

scrapy genspider jd www.jd.com

会在spiders目录下会创建和你起的名字一样的py文件：jd.py，这个文件就是用来写你爬虫的请求和响应逻辑的

3. jd.py文件配置

分析的amazon网站的url规则：

https://search.jd.com/Search?

以防关键字是中文，所以要做urlencode

1.首先写一个start_request函数，用来发送第一次请求，并把请求结果发给回调函数parse_index，同时把reponse返回值传递给回调函数,response类型<class 'scrapy.http.response.html.HtmlResponse'>

 1     def start_requests(self):
 2         # https://www.amazon.cn/s/ref=nb_sb_ss_i_1_6?field-keywords=macbook+pro
 3         # 拼接处符合条件的URL地址
 4         # 并通过scrapy.Requst封装请求，并调用回调函数parse_index处理,同时会把response传递给回调函数
 6         url = 'https://search.jd.com/Search?'
 7         # 拼接的时候field-keywords后面是不加等号的
 9         url += urlencode({"keyword": self.keyword, "enc": "utf-8"})
10         yield scrapy.Request(url,
11                              callback=self.parse_index,
12                              )

2.parse_index从reponse中获取所有的产品详情页url地址，并遍历所有的url地址发送request请求，同时调用回调函数parse_detail去处理结果

 1 def parse_detail(self, response):
 2     """
 3     接收parse_index的回调，并接收response返回值，并解析response
 4     :param response:
 5     :return:
 6     """
 7     jd_url = response.url
 8     sku = jd_url.split('/')[-1].strip(".html")
 9     # price信息是通过jsonp获取，可以通过开发者工具中的script找到它的请求地址
10     price_url = "https://p.3.cn/prices/mgets?skuIds=J_" + sku
11     response_price = requests.get(price_url)
12     # extraParam={"originid":"1"}  skuIds=J_3726834
13     # 这里是物流信息的请求地址，也是通过jsonp发送的，但目前没有找到它的参数怎么获取的，这个是一个固定的参数，如果有哪位大佬知道，好望指教
14     express_url = "https://c0.3.cn/stock?skuId=3726834&area=1_72_4137_0&cat=9987,653,655&extraParam={%22originid%22:%221%22}"
15     response_express = requests.get(express_url)
16     response_express = json.loads(response_express.text)['stock']['serviceInfo'].split('>')[1].split('<')[0]
17     title = response.xpath('//*[@class="sku-name"]/text()').extract_first().strip()
18     price = json.loads(response_price.text)[0]['p']
19     delivery_method = response_express
20     # # 把需要的数据保存到Item中，用来会后续储存做准备
21     item = AmazonItem()
22     item['title'] = title
23     item['price'] = price
24     item['delivery_method'] = delivery_method
25 
26     # 最后返回item，如果返回的数据类型是item，engine会检测到并把返回值发给pipelines处理
27     return item

4. item.py配置

 1 import scrapy
 2 
 3 
 4 class JdItem(scrapy.Item):
 5     # define the fields for your item here like:
 6     # name = scrapy.Field()
 7     # amazome Item
 8     title = scrapy.Field()
 9     price = scrapy.Field()
10     delivery_method = scrapy.Field()

5. pipelines.py配置

 1 from pymongo import MongoClient
 2 
 3 
 4 class MongoPipeline(object):
 5     """
 6     用来保存数据到MongoDB的pipeline
 7     """
 8 
 9     def __init__(self, db, collection, host, port, user, pwd):
10         """
11         连接数据库
12         :param db: databaes name
13         :param collection: table name
14         :param host: the ip for server
15         :param port: thr port for server
16         :param user: the username for login
17         :param pwd: the password for login
18         """
19         self.db = db
20         self.collection = collection
21         self.host = host
22         self.port = port
23         self.user = user
24         self.pwd = pwd
25 
26     @classmethod
27     def from_crawler(cls, crawler):
28         """
29         this classmethod is used for to get the configuration from settings
30         :param crwaler:
31         :return:
32         """
33         db = crawler.settings.get('DB')
34         collection = crawler.settings.get('COLLECTION')
35         host = crawler.settings.get('HOST')
36         port = crawler.settings.get('PORT')
37         user = crawler.settings.get('USER')
38         pwd = crawler.settings.get('PWD')
39 
40         return cls(db, collection, host, port, user, pwd)
41 
42     def open_spider(self, spider):
43         """
44         run once time when the spider is starting
45         :param spider:
46         :return:
47         """
48         # 连接数据库
50         self.client = MongoClient("mongodb://%s:%s@%s:%s" % (
51             self.user,
52             self.pwd,
53             self.host,
54             self.port
55         ))
56 
57     def process_item(self, item, spider):
58         """
59         storage the data into database
60         :param item:
61         :param spider:
62         :return:
63         """
　　　　　　# 获取item数据，并转换成字典格式

64         d = dict(item)
　　　　　　 # 有空值得不保存
65         if all(d.values()):
　　　　　　　　　　# 保存到mongodb中
66             self.client[self.db][self.collection].save(d)
67         return item
68 
69         # 表示将item丢弃，不会被后续pipeline处理
70         # raise DropItem()

6. 配置文件

 1 # database server
 2 DB = "jd"
 3 COLLECTION = "goods"
 4 HOST = "127.0.0.1"
 5 PORT = 27017
 6 USER = "root"
 7 PWD = "123"
 8 ITEM_PIPELINES = {
 9    'MyScrapy.pipelines.MongoPipeline': 300,
10 }