scrapy基本使用

scrapy 框架

下载和开启项目
- pip install wheel : 可以通过.whl文件安装Python相关的模块，和.zip一样
- 下载twisted ,百度下载
- pip install Twisted-17.1.0-cp35-cp35m-win_admin64.whl
- pip install scrapy
- pip install pywin32
- scrapy startproject ProName : 开启项目
- cd ProName ：切换到项目中
- scrapy genspider spiderName www.xxx.com : 新建一个爬虫文件
- scrapu crawl spiderName
数据解析
- 【1】extract（）和extract_first（）

持久化存储

基于终端指令进行持久化存储

将parse方法返回值存储到本地的磁盘文件中

scrapy crawl spiderName -o filepath

 def parse_item(self, response):
     all_data=[]
     li_list = response.xpath('//*[@id="LBMBL"]/ul/li')
     for li in li_list:
         title_text = li.xpath('.//div[@class="title"]/a/text()').extract_first()
         title_url = li.xpath('.//div[@class="title"]/a/@href').extract_first()
         dic = {
             'title_text':title_text,
             'title_url':title_url,
         }
         all_data.append(dic)
         return all_data

基于管道进行持久化存储

编码流程
1. 在爬虫文件中进行数据解析
2. 将items.py中定义相关的属性
3. 将解析到的数据存储到一个item类型的对象中
4. 将item类型的对象提交给管道
5. 管道类的process_item中接收item，之后对item进行任意形式的存储
6. settings中开启管道类
  - ```
   ITEM_PIPELINES = {
      'MoTe.pipelines.MotePipeline': 300, # 优先级，数值越小
   }
```

一个管道类定义一种持久化存储方式

在process_item中返回item，可以实现多个管道类都生效，是将item传递给下一个即将执行的管道类
重写父类方法，让文件和数据库的打开关闭连接等操作都只执行一次，提高效率
注意事项：对于存储的数据有单双引号都是用时，用pymysql或者MysqlDB的escape_string()方法对引号进行转义，不然就会报1064的语法错误

 ITEM_PIPELINES = {
    'MoTe.pipelines.MotePipeline': 300,
    'MoTe.pipelines.MysqlPipeline': 301,
 }

本地文件存储

 class MotePipeline(object):
     fp = None
     # 重写父类方法，该方法只会执行一次，
     def open_spider(self,spider):
     # 该方法调用后就可以接受爬虫类提交的对象
         print('start spider')
         self.fp = open('./a.txt','w',encoding='utf-8')
 
     def process_item(self, item, spider):
         title = item['title_text']
         url = item['title_url']
         # print(title,url)
         self.fp.write(title+url)
         return item #将item传递给下一个即将执行的管道类
 
     def close_spider(self,spider):
         print('finish spider')
         self.fp.close()

数据库存储

 # Mysql
 class MysqlPipeline(object):
     conn = None
     cursor = None
     def open_spider(self,spider):
         self.conn = pymysql.Connection(host = '127.0.0.1',port = 3306,user = 'root',password = '123',db='mote',charset = 'utf8')
         print(self.conn)
 
     def process_item(self,item,spider):
         title = item['title_text']
         url = item['title_url']
 
         sql = 'insert into cmodel values("%s","%s")'%(pymysql.escape_string(title),pymysql.escape_string(url))#转义引号
         self.cursor = self.conn.cursor()
         try:
             self.cursor.execute(sql)
             self.conn.commit()
         except Exception as e:
             print(e)
             self.conn.rollback()
         return item
 
     def close_spider(self,spider):
         self.cursor.close()
         self.conn.close()
 # Redis
 class RedisPipeline(object):
     conn = None
 
     def open_spider(self,spider):
         self.conn = Redis(host='127.0.0.1',port=6379)
         print(self.conn)
 
     def process_item(self,item,spider):
         self.conn.lpush('news',item)

redis版本的3以上的不支持字典作为值，降低版本即可
pip install -U redis==2.10.6

scrapy中封装了一个管道类（ImagesPipeline），基于改管道类可以实现对图片资源的请求和持久化存储

编码流程

爬虫文件中解析出图片地址
将图片地址封装到item并提交给管道类
管道文件中自定义一个管道类（父类为ImagesPipeline）

重写三个方法

 from scrapy.pipelines.images import ImagesPipeline
 class ImgproPipeline(ImagesPipeline):
 
     # 该方法用作于请求发送
     def get_media_requests(self, item, info):
         # 对item中的图片地址进行请求发送，
         print(item)
         yield scrapy.Request(url=item['img_src'])
         
     # 指定文件存储路径（文件夹+文件夹名称）
     def file_path(self, request, response=None, info=None):
         return request.url.split('/')[-1]
     
     # 将item传递给下一个即将被执行的管道类
     def item_completed(self, results, item, info):
         return item

在配置文件中开启管道且加上文件路径指定IMAGES_STORE = './image'

referer防盗链

 DEFAULT_REQUEST_HEADERS = {
     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
     'Accept-Language': 'en',
     'Referer':'https://image.baidu.com/' #指定来源页面为当前页面
 }

手动发送请求

 yield scrapy.Request(nwe_url,callback=self.parse)
 # 写在parse解析函数中，实现自调用，加条件可结束递归

scrapy中进行post请求发送

 yield scrapy.FormRequest(url,callback,formdata)#formdata是请求参数

对起始的url进行post发送

重写父类的方法

 def start_requests(self):
     for url in self.start_urls:
         yield scrapy.FormRequest(url,callback=self.parse(),formdata={})

scrapy中如何提升爬取数据的效率？

 增加并发：
         默认scrapy开启的并发线程为32个，可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。
         
 降低日志级别：
     在运行scrapy时，会有大量日志信息的输出，为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写：LOG_LEVEL = ‘ERROR’
 
 禁止cookie：
     如果不是真的需要cookie，则在scrapy爬取数据时可以禁止cookie从而减少CPU的使用率，提升爬取效率。在配置文件中编写：COOKIES_ENABLED = False
 
 禁止重试：
     对失败的HTTP进行重新请求（重试）会减慢爬取速度，因此可以禁止重试。在配置文件中编写：RETRY_ENABLED = False
 
 减少下载超时：
     如果对一个非常慢的链接进行爬取，减少下载超时可以能让卡住的链接快速被放弃，从而提升效率。在配置文件中进行编写：DOWNLOAD_TIMEOUT = 1 超时时间为10s