Scrapy笔记：持久化，Feed exports的使用

首先要明确的是，其实所有的FeedExporter都是类，里面封装了一般进行io操作的方法。因此，要怎么输出呢？其实从技术实现来说，在生成item的每一步调用其进行储存都是可以的，只不过为了更加符合scrapy的架构，一般都是在Pipeline中使用FeedExporter的。

每一个Exporter的使用都是类似的：

在settings.py中写入相应的配置，

在pipeline中调用exporter：

　　exporter.start_exporter()

　　exporter.export_item()

　　exporter.finish_exporter()

其它工作都已经由scrapy封装好了，所以就不需要再进行额外设定了。

由于item的输出一般是连续输出的，因此可以将export开始和结束的方法放到spider_opened和spider_closed中启动。

以将item输出到json文件为例，下面是相关的配置和写法：

在settings.py中的配置：

1 FEED_FORMAT = 'json' # 输出格式
2 FEED_EXPORTERS_BASE = { 
3     'json': 'scrapy.exporters.JsonItemExporter',
4     'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
5 }

在pipeline中的设定：

 1 class MyCustomPipeline(object):
 2     def __init__(self):
 3         self.files = {}
 4     
 5     @classmethod
 6     def from_crawler(cls, crawler): # 生成pipeline实例的方法
 7          pipeline = cls()
 8          crawler.signals.connect(pipeline.spider_opened, signals.spider_opened) # 将spider_opened连接到信号上，当spider打开时执行spider_opened方法
 9          crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
10          return pipeline
11     def spider_opened(self, spider): #
12         file = open('%s_ip.json' % spider.name, 'w+b') # 生成文件描述符
13         self.files[spider] = file # 保存描述符的引用
14         self.exporter = JsonLinesItemExporter(file) # 实例化一个Exporter类
15         self.exporter.start_exporting() # 开始输出
16  
17     def spider_closed(self,spider):
18         self.exporter.finish_exporting() # 结束输出
19         #print('*'*50)
20         file = self.files.pop(spider)
21         #print(file.name)
22         file.close()
23         
24     def process_item(self, item, spider):
25         self.exporter.export_item(item) # 正式输出
26         return item

那么怎样输出到mysql数据库中呢？

　　实际上scrapy自带的FeedExporter中并没有输出到关系型数据库的输出器，因此只能在pipelines中自己定义函数进行处理。由于scrapy是基于twisted异步框架开发的，使用传统的MySQLdb等mysql连接库会出现阻塞。为此，twisted提供了异步数据库实现方法，也就是使用连接池的方式进行交互。

from twisted.enterprise import adbapi
self.dbpool  = adbapi.ConnectPool(xxxx) # 生成连接池对象
yield self.dbpool.runInteraction(interaction_function, arg) # 返回异步处理数据库交互的方法

具体使用：

假设已经在配置文件settings.py中设定了

1 MYSQL_PIPELINE_URI = 'mysql://root:root@localhost/proxyip' #MySQL的uri

pipelines.py文件中的设置：

 1 class MySQLPipeline(object):
 2 
 3     def __init__(self, mysql_url):
 4         '''创建连接池'''
 5         # 储存以便将来引用
 6         self.mysql_url = mysql_url
 7         # 报告连接错误
 8         self.report_connection_error = True
 9         # 解析mysql的uri，并初始化dbpool
10         conn_kwargs = MySQLPipeline.parse_mysql_url(mysql_url)
11         self.dbpool = adbapi.ConnectionPool('MySQLdb',
12                                             charset='utf8',
13                                             use_unicode=True,
14                                             connect_timeout=5,
15                                             **conn_kwargs)
16     
17     @classmethod
18     def from_crawler(cls, crawler):
19         '''检索crawler，获取settings'''
20         # Get url from settings
21         mysql_url = crawler.settings.get('MYSQL_PIPELINE_URI', None)
22         # 如果没有配置uri，触发错误
23         if not mysql_url:
24             raise NotConfigured
25         # 生成MySQLPipeline实例
26         return cls(mysql_url)
27     
28     def close_spider(self, spider):
29         '''spider关闭时关闭连接池'''
30         self.dbpool.close()
31     @defer.inlineCallbacks
32     def process_item(self, item, spider):
33         '''处理item，将其传入mysql数据库'''
34         logger = spider.logger
35         try:
36             yield self.dbpool.runInteraction(MySQLPipeline._do_replace, item)
37         except MySQLdb.OperationalError:
38             if self.report_connection_error:
39                 print('Can not connect to MySQL:%s'%self.mysql_url)
40                 self.report_connection_error = False
41         
42         else:
43             print(traceback.format_exc())
44         # 返回item给下一阶段
45         defer.returnValue(item)
46         
47     @staticmethod
48     def _do_replace(tx, item):
49         '''实现具体的替换操作'''
50         sql = '''INSERT INTO ips(ip, port, protocol, speed, auth_time, is_transparent) VALUES(%s, %s, %s, %s, %s, %s)'''
51         args = (
52             item['ip'],
53             item['port'],
54             item['protocol'],
55             item['speed'],
56             item['auth_time'],
57             item['is_transparent'],
58             )
59         tx.execute(sql, args)
60         
61         
62     @staticmethod
63     def parse_mysql_url(mysql_url):
64         '''通过url获取数据库连接的参数，提供给adbapi的连接池'''
65         
66         params = dj_database_url.parse(mysql_url)
67         conn_kwargs = {}
68         conn_kwargs['host'] = params['HOST']
69         conn_kwargs['user'] = params['USER']
70         conn_kwargs['passwd'] = params['PASSWORD']
71         conn_kwargs['db'] = params['NAME']
72         conn_kwargs['port'] = params['PORT']
73         # 删除空值
74         conn_kwargs = dict((k,v) for k,v in conn_kwargs.iteritems() if v)
75         
76         return conn_kwargs