elasticsearch

1、安装elasticsearch-rtf(elasticsearch中文发行版,针对中文集成了相关插件,方便新手学习测试.)

  https://github.com/ 上搜索elasticsearch-rtf下载最新版,cmd运行bin文件夹下elasticsearch.bat

2、在浏览器中输入:127.0.0.1:9200显示如下则安装成功:

------------------------------------

{
  "name" : "ewadZmQ",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "-BfaRD5ETwuGxlEEPqJNqQ",
  "version" : {
    "number" : "5.1.1",
    "build_hash" : "5395e21",
    "build_date" : "2016-12-06T12:36:15.409Z",
    "build_snapshot" : false,
    "lucene_version" : "6.3.0"
  },
  "tagline" : "You Know, for Search"
}
---------------------------------------
3、head插件安装

 1)在github上搜索elasticsearch-head下载第一个,

 2)安装node.js(http://nodejs.cn/download/),安装完成后输入:node -v 输出v6.10.3 这样的版本号,就安装成功了,再输入:npm - v输出3.10.10 这样的版本号npm就安装成功了(node.js集成了npm)

 3)安装cnpm(http://npm.taobao.org/),cmd下运行:npm install -g cnpm --registry=https://registry.npm.taobao.org

 4)cmd到elasticsearch-head目录,运行cnpm install 完成后再运行cnpm run start

 5)打开网页:http://localhost:910如下图

提示链接不到http://127.0.0.1:9200/端口,为什么?elasticsearch默认情况下不允许使用第三方服务,所以不能链接

 解决:在elasticsearch-rft的config文件夹下的elasticsearch.yml文件最后加入如下配置:

http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-methods: OPTIONS, HEAD, GET, POST, PUT, DELETE
http.cors.allow-headers: "X-Requested-With, Content-Type, Content-Length, X-User"

重起elasticsearch,如下图链接成功

6)下载安装Kibana 5.1.1(elasticsearch是5.1.1)(https://www.elastic.co/downloads/past-releases),cmd下在bin文件夹下运行kibana.bat文件,打开网页:http://127.0.0.1:5601/,安装成功。

7)把scrapy数据写入到elasticsearch:
  a、先cmd到虚拟环境中安装重
elasticsearch-dsl(scrapy操作elasticsearch的高级接口):
pip install elasticsearch-dsl
  b、创建文件夹models,再创建一个es_types.py文件,定义字段类型并运行文件建立索引:
from datetime import datetime
from elasticsearch_dsl import DocType, Date, Nested, Boolean, analyzer, InnerObjectWrapper, Completion, Keyword, Text, Integer
from elasticsearch_dsl.connections import connections
connections.create_connection(hosts=['localhost'])
class ArticleType(DocType):
    #文章类型
    title = Text(analyzer="ik_max_word")
    create_date = Date()
    praise_nums = Integer()
    fav_nums = Integer()
    comment_nums = Integer()
    tags = Text(analyzer="ik_max_word")
    front_image_url = Keyword()
    url_object_id = Keyword()
    front_image_path = Keyword()
    url = Keyword()
    content = Text(analyzer="ik_max_word")

    class Meta:
        index = 'jobbole'
        doc_type = 'article'
if __name__ == '__main__':
    ArticleType.init()

  c、在Pipelines.py文件字义一个pipeline类:
class ElasticsearchPipeline(object):
    #把数据写入elasticsearch
    def process_item(self, item, spider):
        #把item转换为elasticsearch数据
        article = ArticleType()
        article.title = item['title']
        article.create_date = item['create_date']
        article.content = remove_tags(item['content'])  #remove_tags()去除html标签
        article.front_image_url = item['front_image_url']
        article.front_image_path = item['front_image_path']
        article.praise_nums = item['praise_nums']
        article.fav_nums = item['fav_nums']
        article.comment_nums = item['comment_nums']
        article.url = item['url']
        article.tags = item['tags']
        article.meta.id = item['url_object_id']

        article.save() #保存
        return item

  d、再把Pipelines.py文件中的ElasticsearchPipeline类配置到settings.py文件中:

ITEM_PIPELINES = {'spider.pipelines.ElasticsearchPipeline': 1}

  e、运行scrapy程序,在http://127.0.0.1:9100/中的数据浏览中显示如下,则配置成功。

 优化:为了不同爬虫能利用同一个Pipelines类,把Pipelines类功能放入到item.py文件中的相应item类中:

class JobboleArticleItem(scrapy.Item):
    title = scrapy.Field()
    create_date = scrapy.Field(input_processor=MapCompose(date_convert))
    praise_nums = scrapy.Field(input_processor=MapCompose(number_convert))
    fav_nums = scrapy.Field(input_processor=MapCompose(number_convert))
    comment_nums = scrapy.Field(input_processor=MapCompose(number_convert))
    tags = scrapy.Field(input_processor=MapCompose(remove_comment_tags), output_processor=Join(','))
    front_image_url = scrapy.Field(output_processor=MapCompose(returnValue))
    url_object_id = scrapy.Field(input_processor=MapCompose(get_md5))
    front_image_path = scrapy.Field()
    url = scrapy.Field()
    content = scrapy.Field()

    def get_insert_mysql(self):
      #写入数据到mysql
        insert_sql = """
                    insert into jobbole(front_image_url,front_image_path,title,url,create_date,url_object_id,fav_nums,comment_nums,praise_nums,tags,content)
                    values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
                    ON DUPLICATE KEY UPDATE fav_nums=VALUES(fav_nums),comment_nums=VALUES(comment_nums),praise_nums=VALUES(praise_nums)
                    """
        params = (self['front_image_url'][0], self['front_image_path'], self['title'], self['url'], self['create_date'],
                  self['url_object_id'], self['fav_nums'], self['comment_nums'], self['praise_nums'], self['tags'],
                  self['content'])
        return insert_sql, params

    def save_to_elasticsearch(self):
        #写入数据到elasticsearch
        article = ArticleType()
        article.title = self['title']
        article.create_date = self['create_date']
        article.content = remove_tags(self['content'])  # remove_tags()去除html标签
        article.front_image_url = self['front_image_url']
        if 'front_image_path' in self:
            article.front_image_path = self['front_image_path']
        article.praise_nums = self['praise_nums']
        article.fav_nums = self['fav_nums']
        article.comment_nums = self['comment_nums']
        article.url = self['url']
        article.tags = self['tags']
        article.meta.id = self['url_object_id']

        article.save()  # 保存
        return        

然后再在Pipelines.py文件pipeline类调用save_to_elasticsearch():

class ElasticsearchPipeline(object):
    #把数据写入elasticsearch
    def process_item(self, item, spider):
        #把item转换为elasticsearch数据
        item.save_to_elasticsearch()
        return item
 
原文地址:https://www.cnblogs.com/jp-mao/p/6933480.html