python爬取elasticsearch内容

我们以上篇的elasticsearch添加的内容为例，对其内容进行爬取，并获得有用信息个过程。

先来看一下elasticsearch中的内容：

{
  "took": 88,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "2",
        "_score": 1,
        "_source": {
          "first_name": "Jane",
          "last_name": "Smith",
          "age": 32,
          "about": "I like to collect rock albums",
          "interests": [
            "music"
          ]
        }
      },
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "1",
        "_score": 1,
        "_source": {
          "first_name": "John",
          "last_name": "Smith",
          "age": 25,
          "about": "I love to go rock climbing",
          "interests": [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "3",
        "_score": 1,
        "_source": {
          "first_name": "Douglas",
          "last_name": "Fir",
          "age": 35,
          "about": "I like to build cabinets",
          "interests": [
            "forestry"
          ]
        }
      }
    ]
  }
}

1.在python中，首先要用到urllib的包，其次对其进行读取的格式为json。

import urllib.request as request
import json

2.接下来，我们获取相应的路径请求，并用urlopen打开请求的文件：

if __name__ == '__main__':
    req = request.Request("http://localhost:9200/megacorp/employee/_search")
    resp = request.urlopen(req)

3.对得到的resp,我们需要用json的格式迭代输出：（注意是字符串类型）

jsonstr=""
    for line in resp:
        jsonstr+=line.decode()
    data=json.loads(jsonstr)
    print(data)

4.但是我们得到的信息是包含内容和属性的，我们只想得到内容，那么久需要对每层的属性进行分解获取：

employees = data['hits']['hits']
 
    for e in employees:
        _source=e['_source']
        full_name=_source['first_name']+"."+_source['last_name']
        age=_source["age"]
        about=_source["about"]
        interests=_source["interests"]
        print(full_name,'is',age,",")
        print(full_name,"info is",about)
        print(full_name,'likes',interests)

得到的内容为：

Jane.Smith is 32 ,
Jane.Smith info is I like to collect rock albums
Jane.Smith likes ['music']

John.Smith is 25 ,
John.Smith info is I love to go rock climbing
John.Smith likes ['sports', 'music']

Douglas.Fir is 35 ,
Douglas.Fir info is I like to build cabinets
Douglas.Fir likes ['forestry']

对于需要聚合的内容，我们可以通过下面的方法进行获取：

1：获取路径

url="http://localhost:9200/megacorp/employee/_search"

2.获取聚合的格式查询

data='''
    {
    "aggs" : {
        "all_interests" : {
            "terms" : { "field" : "interests" },
            "aggs" : {
                "avg_age" : {
                    "avg" : { "field" : "age" }
                }
            }
        }
    }
}
    '''

3.标明头部信息

headers={"Content-Type":"application/json"}

4.同样，以请求和相应的方式获取信息并迭代为json格式

req=request.Request(url=url,data=data.encode(),headers=headers,method="GET")
    resp=request.urlopen(req)
    jsonstr=""
    for line in resp:
        jsonstr+=line.decode()
    rsdata=json.loads(jsonstr)

5.有用聚合信息内部依然是数组形式，所以依然需要迭代输出：

agg = rsdata['aggregations']
buckets = agg['all_interests']['buckets']
    
    for b in buckets:
        key = b['key']
        doc_count = b['doc_count']
        avg_age = b['avg_age']['value']        
        print('aihao',key,'gongyou',doc_count,'ren,tamenpingjuageshi',avg_age)

最终得到信息：

aihao music gongyou 2 ren,tamenpingjuageshi 28.5

aihao forestry gongyou 1 ren,tamenpingjuageshi 35.0

aihao sports gongyou 1 ren,tamenpingjuageshi 25.0