云端分布式搜索技术

http://www.elasticsearch.org/overview/

一些国外优秀的elasticsearch使用案例

Github “Github使用Elasticsearch搜索20TB的数据，包括13亿的文件和1300亿行的代码” 这个不用介绍了吧，码农们都懂的，Github在2013年1月升级了他们的代码搜索，由solr转为elasticsearch，目前集群规模为26个索引存储节点和8个客户端节点（负责处理搜索请求），详情请看官方博客 https://github.com/blog/1381...

2013-03-28 19:26 阅读(1429) 评论(3)

[置顶] 博客转移到独立站点

以后有关elasticsearch的文章会优先在下面站点发 http://www.searchtech.pro/...

2013-03-15 23:07 阅读(740) 评论(0)

Elasticsearch Java虚拟机配置详解（转）

引言：今天，事情终于发生了。Java6（Mustang），是2006年早些时候出来的，至今仍然应用在众多生产环境中，现在终于走到了尽头。已经没有什么理由阻止迁移到Java7(Dolphin)上了。这也促使我想写一篇关于在ElasticSearch上配置Java6和7的细微差异的博文。 Elasticsearch对Java虚拟机进行了预先的配置。通常情况下，因为这些配置的选择还...

2013-01-03 20:52 阅读(1011) 评论(0)

分布式搜索Elasticsearch源码分析之二------索引过程源码概要分析

elasticsearch的索引逻辑简单分析，这里只是理清主要的脉络，一些细节方面以后的文章或会阐述。假如通过java api来调用es的索引接口，先是构造成一个json串（es里表示为XContent，是对要处理的内容进行抽象），在IndexRequest里面指定要索引文档到那个索引库（index）、其类型（type）还有文档的id，如果没有指定文档的id，es会通过UUID工具自动...

2012-12-29 13:56 阅读(1958) 评论(2)

在使用基于lucene的各类搜索引擎（如：elasticsearch、solr）时，有可能出现类似如下的错误： Caused by: java.io.EOFException: read past EOF: NIOFSIndexInput(path="/usr/local/sas/escluster/data/cluster/nodes/0/indices/index/5/index/_59ct...

2012-12-14 19:54 阅读(1568) 评论(4)

分布式搜索elasticsearch集群监控工具bigdesk

bigdesk是elasticsearch的一个集群监控工具，可以通过它来查看es集群的各种状态，如：cpu、内存使用情况，索引数据、搜索情况，http连接数等。项目git地址： https://github.com/lukas-vlcek/bigdesk。和head一样，它也是个独立的网页程序，使用方式和head一样。插件安装运行： 1.bin/plugin -install lukas-...

2012-11-21 14:43 阅读(2016) 评论(6)

分布式搜索elasticsearch集群管理工具head

elasticsearch-head是一个elasticsearch的集群管理工具，它是完全由html5编写的独立网页程序，你可以通过插件把它集成到es。或直接下载源码，在本地打开index.html运行它。该工具的git地址是： https://github.com/Aconex/elasticsearch-head 插件安装方法： 1.elasticsearch/bin/plu...

2012-11-17 14:38 阅读(2788) 评论(1)

生产环境使用elasticsearch遇到的一些问题以及解决方法（不断更新）

1.由gc引起节点脱离集群因为gc时会使jvm停止工作，如果某个节点gc时间过长，master ping3次（zen discovery默认ping失败重试3次）不通后就会把该节点剔除出集群，从而导致索引进行重新分配。解决方法：（1）优化gc，减少gc时间。（2）调大zen discovery的重试次数（es参数：ping_retries）和超时时间（es参数：ping_ti...

2012-11-17 02:23 阅读(2764) 评论(1)

Elasticsearch源码分析之一——使用Guice进行依赖注入与模块化系统

elasticsearch使用google开源的依赖注入框架guice，这个项目号称比spring快100倍，具体性能没有测试过，不过由于其代码比较简洁，比spring快很有可能，是不是快那么多就不知道了。先介绍下guice的基本使用方法。 elasticsearch是直接把guice的源码放到自己的包内（es把很多开源项目的代码都直接集成到自己项目中，省得依赖一堆的jar包，也使es的jar包...

2012-09-19 19:50 阅读(1817) 评论(0)

分布式搜索elasticsearch高级配置之（二）------线程池设置

一个Elasticsearch节点会有多个线程池，但重要的是下面四个：索引（index）：主要是索引数据和删除数据操作（默认是cached类型）搜索（search）：主要是获取，统计和搜索操作（默认是cached类型）批量操作（bulk）：主要是对索引的批量操作（默认是cached类型）更新（refresh）：主要是更新操作（默认是cached类型）可以通过给设置一个参数来...

2012-09-04 20:14 阅读(2025) 评论(0)

分布式搜索elasticsearch java API 之（八）------使用More like this实现基于内容的推荐

基于内容的推荐通常是给定一篇文档信息，然后给用户推荐与该文档相识的文档。Lucene的api中有实现查询文章相似度的接口，叫MoreLikeThis。Elasticsearch封装了该接口，通过Elasticsearch的More like this查询接口，我们可以非常方便的实现基于内容的推荐。先看一个查询请求的json例子： { "more_like_this" : {...

2012-08-28 20:37 阅读(2000) 评论(0)

分布式搜索elasticsearch高级配置之（一）------分片分布规则设置

分片分布是把索引分片分布到节点的过程。这个操作会在初次启动集群，副本分配，负载均衡，或增加删除节点时进行。下面是一些与分片分布相关的设置： cluster.routing.allocation.allow_rebalance 设置根据集群中机器的状态来重新分配分片，可以设置为always, indices_primaries_active和indices_all_active，默认是设...

2012-07-29 22:21 阅读(2919) 评论(6)

分布式搜索elasticsearch中文分词集成

elasticsearch官方只提供smartcn这个中文分词插件，效果不是很好，好在国内有medcl大神（国内最早研究es的人之一）写的两个中文分词插件，一个是ik的，一个是mmseg的，下面分别介绍下两者的用法，其实都差不多的，先安装插件，命令行：安装ik插件： plugin -install medcl/elasticsearch-analysis-ik/1.1.0 下载ik相关配置...

2012-07-27 12:36 阅读(4745) 评论(7)

分布式搜索elasticsearch java API 之（七）------与MongoDB同步数据

elasticsearch提供river这个模块来读取数据源中的数据到es中，es官方有提供couchDB的同步插件，因为项目用到的是mongodb，所以在找mongodb方面的同步插件，在git上找到了elasticsearch-river-mongodb。这个插件最初是由aparo写的，最开始的功能就是读取mongodb里面的表，记录最后一条数据的id，根据时间间隔不断访问m...

2012-06-26 21:25 阅读(3396) 评论(12)

分布式搜索elasticsearch java API 之（六）------批量添加删除索引

elasticsearch支持批量添加或删除索引文档，java api里面就是通过构造BulkRequestBuilder，然后把批量的index/delete请求添加到BulkRequestBuilder里面，执行BulkRequestBuilder。下面是个例子： import static org.elasticsearch.common.xcontent.XContentFactory....

2012-05-27 10:08 阅读(2800) 评论(15)

分布式搜索elasticsearch java API 之（五）------搜索

elasticsearch的查询是通过执行json格式的查询条件，在java api中就是构造QueryBuilder对象，elasticsearch完全支持queryDSL风格的查询方式，QueryBuilder的构建类是QueryBuilders，filter的构建类是FilterBuilders。下面是构造QueryBuilder的例子： import static org.elasti...

2012-05-27 09:49 阅读(4728) 评论(15)

分布式搜索elasticsearch java API 之（四）------删除索引数据

删除api允许从特定索引通过id删除json文档。有两种方法，一是通过id删除，二是通过一个Query查询条件删除，符合这些条件的数据都会被删除。一、通过id删除下面的例子是删除索引名为twitter，类型为tweet，id为1的文档： DeleteResponse response = client.prepareDelete("twitter", "tweet", "1")...

2012-04-14 14:50 阅读(2571) 评论(0)

分布式搜索elasticsearch java API 之（三）------索引数据

es索引数据非常方便，只需构建个json格式的数据提交到es就行，下面是个java api的例子 XContentBuilder doc = jsonBuilder() .startObject() .field("title", "this is a title!") .fiel...

2012-04-14 14:21 阅读(3717) 评论(12)

分布式搜索elasticsearch java API 之（二）------put Mapping定义索引字段属性

Mapping,就是对索引库中索引的字段名及其数据类型进行定义，类似于关系数据库中表建立时要定义字段名及其数据类型那样，不过es的mapping比数据库灵活很多，它可以动态添加字段。一般不需要要指定mapping都可以，因为es会自动根据数据格式定义它的类型，如果你需要对某些字段添加特殊属性（如：定义使用其它分词器、是否分词、是否存储等），就必须手动添加mapping。有两种添加mapping的方...

2012-04-14 13:53 阅读(4559) 评论(8)

分布式搜索elasticsearch配置文件详解

elasticsearch的config文件夹里面有两个配置文件：elasticsearch.yml和logging.yml，第一个是es的基本配置文件，第二个是日志配置文件，es也是使用log4j来记录日志的，所以logging.yml里的设置按普通log4j配置文件来设置就行了。下面主要讲解下elasticsearch.yml这个文件中可配置的东西。 cluster.name: elasti...

2012-04-02 10:26 阅读(5229) 评论(5)

分布式搜索elasticsearch几个概念解析

介绍下es的几个概念： cluster 代表一个集群，集群中有多个节点，其中有一个为主节点，这个主节点是可以通过选举产生的，主从节点是对于集群内部来说的。es的一个概念就是去中心化，字面上理解就是无中心节点，这是对于集群外部来说的，因为从外部来看es集群，在逻辑上是个整体，你与任何一个节点的通信和与整个es集群通信是等价的。 shards 代表索引分片，es可以...

2012-04-02 02:16 阅读(7914) 评论(7)

分布式搜索elasticsearch单机与服务器环境搭建

先到http://www.elasticsearch.org/download/下载最新版的elasticsearch运行包，本文写时最新的是0.19.1，作者是个很勤快的人，es的更新很频繁，bug修复得很快。下载完解开有三个包:bin是运行的脚本，config是设置文件，lib是放依赖的包。如果你要装插件的话就要多新建一个plugins的文件夹，把插件放到这个文件夹中。 1.单机环境：...

2012-03-31 14:20 阅读(4988) 评论(8)

分布式搜索elasticsearch java API 之（一）------与集群交互

这是关于elasticsearch java api的第一篇教程，陆续会把es的一些心得写出来。可以通过两种方式来连接到elasticsearch（简称es）集群，第一种是通过在你的程序中创建一个嵌入es节点（Node），使之成为es集群的一部分，然后通过这个节点来与es集群通信。第二种方式是用TransportClient这个接口和es集群通信。 Node方式创建嵌入节点的方式如下：...

2012-03-30 21:01 阅读(5257) 评论(9)

what is elasticsearch?

distributed restful search and analytics

real time data

Data flows into your system all the time. The question is … how quickly can that data become an insight? With Elasticsearch, real-time is the only time.

Expand

distributed

Elasticsearch allows you to start small, but will grow with your business. It is built to scale horizontally out of the box. As you need more capacity, just add more nodes, and let the cluster reorganize itself to take advantage of the extra hardware.

Expand

multi-tenancy

A cluster can host multiple indices which can be queried independently or as a group. Index aliases allow you to add indexes on the fly, while being transparent to your application.

document oriented

Store complex real world entities in Elasticsearch as structured JSON documents. All fields are indexed by default, and all the indices can be used in a single query, to return results at breath taking speed.

schema free

Elasticsearch allows you to get started easily. Toss it a JSON document and it will try to detect the data structure, index the data and make it searchable. Later, apply your domain specific knowledge of your data to customize how your data is indexed.

per-operation persistence

Elasticsearch puts your data safety first. Document changes are recorded in transaction logs on multiple nodes in the cluster to minimise the chance of any data loss.

Expand

build on top of apache lucene ™

Apache Lucene is a high performance, full-featured Information Retrieval library, written in Java. Elasticsearch uses Lucene internally to build its state of the art distributed search and analytics capabilities.

real time analytics

Search isn’t just free text search anymore – it’s about exploring your data. Understanding it. Gaining insights that will make your business better or improve your product.

Expand

high availability

Elasticsearch clusters are resilient – they will detect and remove failed nodes, and reorganize themselves to ensure that your data is safe and accessible.

Expand

full text search

Elasticsearch uses Lucene under the covers to provide the most powerful full text search capabilities available in any open source product. Search comes with multi-language support, a powerful query language, support for geolocation, context aware did-you-mean suggestions, autocomplete and search snippets.

Expand

conflict management

Optimistic version control can be used where needed to ensure that data is never lost due to conflicting changes from multiple processes

restful api

Elasticsearch is API driven. Almost any action can be performed using a simple RESTful API using JSON over HTTP. An API already exists in the language of your choice.

apache 2 open source license

Elasticsearch can be downloaded, used and modified free of charge. It is available under the Apache 2 license, one of the most flexible open source licenses available.