ElasticSearch 之Jest应用实践

在之前“JAVA ElasticSearch 访问控制最佳解决方案”文章中，提到了推荐使用Jest实现ES的CRUD操作，那么Jest到底是啥呢？

Jest实际是Elasticsearch 的Java Http Rest 客户端。

ElasticSearch已经具备应用于Elasticsearch内部的Java API，但是Jest弥补了ES自有API缺少Elasticsearch Http Rest接口客户端的不足。

从接口定义来看，Jest具体如下优势呢~

1）提供Restful API，原生ES API不具备；

2）若ES集群使用不同的ES版本，使用原生ES API会有问题，而Jest不会；

3）更安全（可以在Http层添加安全处理）。

那么是否有权威的Jest应用实践案例呢？当然有了，IBM技术论坛给出了完整的Jest实战代码，简洁易懂，不容错过哦~

Scalable searching with ElasticSearch

Distributed search for Java enterprise applications

Andrew Glover
Published on November 27, 2012

When I was in high school, google was just a noun representing an incredibly large number. Today, we sometimes use google as a verb synonymous with online browsing and searching, and we also use it to refer to the eponymous company. It is common to invoke "Papa Google" as an answer for almost any question: "Just google it!" It follows that application users expect to be able to search the data (files, logs, articles, images, and so on) that an application stores. For software developers, the challenge is to enable search functionality quickly and easily, without losing too much sleep, or cash, to do it.

About this series

The Java development landscape has changed radically since Java technology first emerged. Thanks to mature open source frameworks and reliable for-rent deployment infrastructures, it's now possible to assemble, test, run, and maintain Java applications quickly and inexpensively. In this series, Andrew Glover explores the spectrum of technologies and tools that make this new Java development paradigm possible.

User queries are becoming more complex and personalized over time, and much of the data required to deliver an appropriate response is inherently unstructured. Where once an SQL LIKE clause was good enough, today's usage sometimes calls for sophisticated algorithms. Fortunately, a number of open source and commercial platforms address the need for pluggable search technology, including Lucene, Sphinx, Solr, Amazon's CloudSearch, and Xapian. This installment of Java development 2.0 introduces ElasticSearch, a newer player in the field of open source search platforms.

First, I will show you how to install and configure ElasticSearch quickly. Then, I'll show you how to define a search infrastructure, add searchable content, and search through that content. The examples are based on an existing application (the USA Today Music Reviews feed and API) but could work just as well for an app that you're building. We'll use ElasticSearch along with a couple of other open source tools: cURL is a platform-agnostic command-line tool for working with HTTP URLs, and Jest is a Java library built for ElasticSearch, which we'll use to capture, store, and manipulate our data.

Distributed searching with ElasticSearch

ElasticSearch is one of a number of open source search platforms. Its service is to offer an additional component (a searchable repository) to an application that already has a database and web front-end. ElasticSearch provides the search algorithms and related infrastructure for your application. You simply upload application data into the ElasticSearch datastore and interact with it via RESTful URLs. You can do this either directly or indirectly via a library like cURL or Jest.

ElasticSearch is a downloadable application. Some cloud-based platforms have begun to offer it as a service. In this article, we'll use ElasticSearch as an embeddable tool.

The architecture of ElasticSearch is distinctly different from its predecessors in that it is expressly built with horizontal scaling in mind. Unlike some other search platforms, ElasticSearch is designed to be distributed. This feature dovetails quite nicely with the rise of cloud and big data technologies. ElasticSearch is built on top of one of the more stable open source search engines, Lucene, and it works similarly to a schema-less JSON document datastore. Its singular purpose is to enable text-based searching.

ElasticSearch is easy to install and integrate into your application. You can use a RESTful API to interact with ElasticSearch in the language of your choice. It also comes with a plethora of language adaptors produced by a vibrant and growing open source community.

Ask the oracle

How warm was it at this time last year in Paris? How many people voted in the 2008 U.S. presidential election? Is it a good idea to pop a blister on my toe? These are just a few samples of the types of questions millions of users post to web browsers every day around the world. Not only do we feel less need to keep factual information on hand (in our brains or in books, for example), but we have access to a much vaster and more random supply of it — a veritable google of information, in fact. Naturally, this societal shift puts some new demands on our applications and related search technology.

Installing and configuring ElasticSearch

Because ElasticSearch is built on top of Lucene, everything in it boils down to Java code. To get started, simply download the latest release of ElasticSearch, un-archive it, and fire it up by invoking your target platform's start script. You'll note that ElasticSearch offers an array of configurations, but for the purpose of this article, we will stick with the defaults provided. Rather than enabling nodes to auto-discover one another and create a cluster (an exciting feature, by the way), our examples will be based on a single node that will act as a database of documents.

Show me what I like

As I mentioned earlier, users expect to be able to search for most any kind of data that an application stores and manipulates. So the first thing we need for our working example is some data. To make things interesting, we'll use data from USA Today, which is freely available via the site's API. I'm going to grab a feed of USA Today music reviews and upload it into ElasticSearch. This process is commonly known as indexing.

USA Today's music reviews aren't currently categorized by a particular genre or artist. That poses a challenge if I want to do an associative search; that is, if I want to find positive reviews for artists similar to other artists whom I like. As an example, I might search for blues artists who sound like Buddy Guy.

If you want to follow along with me as I pull data from USA Today, you will need to register for a free developer key on the site. Once you've done that, you can access the API via RESTful URLs. Listing 1 shows a sample call to obtain a single music review (note that you'll have to use your own developer key in your code):

Listing 1. An API call to the USA Today music review service

curl-XGET 'http://api.usatoday.com/open/reviews/music/recent?count=1&api_key=your_key'

Listing 2 shows what the corresponding JSON response looks like:

Listing 2. Response from the service

{"APIParameters":
 {"Count":"1","MinimumRating":"","MaximumRating":"","Artist":"",
   "ArtistSearch":true,"Album":"",
   "AlbumSearch":true,"Year":""},
  "Found":1,"Albums":null,"Artists":null,
  "MusicReviews":[
      {"AlbumName":"Away From the World",
       "ArtistName":"Dave Matthews Band",
       "ReleaseDate":"",
       "Rating":"3",
       "DownloadSongs":"Mercy, Snow Outside, Drunken Soldier",
       "ConsiderSongs":"",
       "Reviewer":"Brian Mansfield",
       "ReviewDate":"9/11/2012 10:11:00 AM",
       "Brief":"...",
       "WebUrl":"..."
       }
  ]
}

Because I'm searching for music I'm likely to enjoy, I want to capture at least three parts of the review: the brief (which is the heart of the music review), the rating, and the WebUrl. This lets me see personal reviews, numerical ratings, and a URL where I can check out the music for myself.

Setting up the ElasticSearch index

ElasticSearch uses a RESTful web interface for interaction. I'm going to use the command-line tool cURL to access that interface. Before putting any documents into ElasticSearch, I need to create an index, which is something similar to a database table. I'll store searchable documents (in this case music reviews) in the ElasticSearch index. Listing 3 demonstrates how easy it is to create an ElasticSearch index using cURL. (By default, ElasticSearch captures and indexes every document you give it.)

Listing 3. Creating an ElasticSearch index using cURL

curl -XPUT 'http://localhost:9200/music_reviews/'

Next, I can specify specific mappings for particular attributes of a document. The particular attributes are automatically inferred. For instance, if the document contains a value like name:‘test', ElasticSearch will infer that the name attribute is a String. Or if a document has the attribute score:1, ElasticSearch will rightfully guess that score is a number.

Occasionally, ElasticSearch does guess incorrectly — for instance, for a date formatted as a String. In these cases, you can instruct ElasticSearch about how to map a particular value. In Listing 4, I instruct ElasticSearch to treat a music review's reviewDate as a Date rather than a String:

Listing 4. Mapping in the music_reviews index

curl -XPUT 'http://localhost:9200/music_reviews/_mapping' -d 
  '{"review": { "properties": { 
     "reviewDate":
      {"type":"date", "format":"MM/dd/YY HH:mm:ss aaa", "store":"yes"} } } }'

Listing 4 demonstrates how easy it is to interact with ElasticSearch's RESTful AP via cURL.

Capturing data as POJOs

We've defined an ElasticSearch index and mapped a particular attribute, so now it's time to insert some music reviews. For this, I'm going to use a Java API dubbed Jest that handles Java object serialization quite nicely. With Jest, you can take normal Java objects and index them into ElasticSearch. Then, using ElasticSearch's search API, you can convert the results of a search back into Java objects. Automatic POJO serialization can be handy in that you don't have to deal with the underlying JSON document structure that ElasticSearch requires.

I'll create a simple Java object that represents a music review, then I'll index it using Jest. Because I'm ultimately receiving a JSON representation of a music review from USA Today's API, I'm going to code up a factory method that will convert a JSON document into my object. I could easily omit the entire POJO step (and just index the straight JSON from USA Today) but later I'd like to show you how to automatically convert a search result into a POJO.

Listing 5. A simple POJO representing a music review result

import io.searchbox.annotations.JestId;
import net.sf.json.JSONObject;

public class MusicReview {
  private String albumName;
  private String artistName;
  private String rating;
  private String brief;
  private String reviewDate;
  private String url;

  @JestId
  private Long id;

  public static MusicReview fromJSON(JSONObject json) {
   return new MusicReview(
    json.getString("Id"),
    json.getString("AlbumName"),
    json.getString("ArtistName"),
    json.getString("Rating"),
    json.getString("Brief"),
    json.getString("ReviewDate"),
    json.getString("WebUrl"));
  }

  public MusicReview(String id, String albumName, String artistName, String rating, 
    String brief,
   String reviewDate, String url) {
    this.id = Long.valueOf(id);
    this.albumName = albumName;
    this.artistName = artistName;
    this.rating = rating;
    this.brief = brief;
    this.reviewDate = reviewDate;
    this.url = url;
  }

  //...setters and getters omitted

}

Note that in ElasticSearch each indexed document has an id, which you can think of as a primary key. You can always get a particular document by its corresponding id. So in the Jest API, I associate the ElasticSearch document id with my object using the @JestId annotation, as shown in Listing 5. In this case, I've used the ID provided by the USA Today API.

The JestClient

Next, I will use Jest to invoke the USA Today API to return a collection of reviews, convert those JSON documents into MusicReview objects, and index each one into my locally running ElasticSearch application.

As you see from Jest's API call in Listing 6, ElasticSearch is designed to work in a cluster. In this case, we have only have one server node to connect to, but it's worth noting that a connection can take a list of server addresses.

Listing 6. Creating a connection to an ElasticSearch instance with Jest

ClientConfig clientConfig = new ClientConfig();
Set<String> servers = new LinkedHashSet<String>();
servers.add("http://localhost:9200");
clientConfig.getServerProperties().put(ClientConstants.SERVER_LIST, servers);

Once I have a ClientConfig object fully initialized, I can create an instance of a JestClient like what you see in Listing 7:

Listing 7. Creating a client object

JestClientFactory factory = new JestClientFactory();
factory.setClientConfig(clientConfig);
JestClient client = factory.getObject();

With the connection pointing to my locally running ElasticSearch instance, I'm ready to grab some (let's say 300) music reviews from the USA Today service and index them.

Listing 8. Capture and index music reviews in a local ElasticSearch instance

URL url = 
  new URL("http://api.usatoday.com/open/reviews/music/recent?count=300&api_key=_key_");
String jsonTxt = IOUtils.toString(url.openConnection().getInputStream());
JSONObject json = (JSONObject) JSONSerializer.toJSON(jsonTxt);
JSONArray reviews = (JSONArray) json.getJSONArray("MusicReviews");
for (Object jsonReview : reviews) {
  MusicReview review = MusicReview.fromJSON((JSONObject) jsonReview);
  client.execute(new Index.Builder(review).index("music_reviews")
   .type("review").build());
}

Notice the final line of the for loop in Listing 8. This code takes my MusicReview POJO and indexes it into ElasticSearch; that is, it places the POJO in a music_reviews index as a review type. ElasticSearch will then take this document and work some serious magic on it, so that we can search aspects of it later.

Searching unstructured data

The power of ElasticSearch is that it enables you to search unstructured data. An example of unstructured data is the brief part of a music review: a paragraph of text describing some music. That brief has a lot of data in it, but what we need are keywords that could indicate affinity. It's those keyword associations that help a search engine return just the results that a user is looking for. In this case, I'm looking for music that I might be interested in hearing, based on music that I already like. So I'll search for music that has been described using the same keywords that were used to describe some of my favorite music.

So for instance, I might search the brief attribute of my indexed collection for the word jazz (note, that this search is case-insensitive). I have to do a few things before I can run a search with Jest. First, I have to create a term query via the QueryBuilder type. I then add that to a Search, which points to an index and type. Also note that Jest takes the JSON response from ElasticSearch and turns it into a collection of MusicReviews.

Listing 9. Searching with Jest

QueryBuilder queryBuilder = QueryBuilders.termQuery("brief", "jazz");
Search search = new Search(queryBuilder);
search.addIndex("music_reviews");
search.addType("review");
JestResult result = client.execute(search);

List<MusicReview> reviewList = result.getSourceAsObjectList(MusicReview.class);
for(MusicReview review: reviewList){
  System.out.println("search result is " + review);
}

The search operation in Listing 10 should be very familiar to a Java developer. Working with POJOs via Jest is an easy process. Note, however, that ElasticSearch is entirely RESTfully driven, so we could easily do the same search using cURL, like so:

Listing 10. Searching with cURL

curl -XGET 'http://localhost:9200/music_reviews/_search?pretty=true' -d
 ' {"explain": true, "query" : { "term" : { "brief" : "jazz" } }}'

JSON can be hard to read, so you can always pass in the pretty=true option to any search request. In Listing 10, I've also specified that ElasticSearch return an explain plan for how the search was executed. I did this by adding the "explain":true phrase to the JSON document when I passed it in.

Explain plan?

An explain plan simply explains what ElasticSearch did under-the-hood to find your document. This information can be helpful if you want to fine-tune some queries or specify particular index options. Many RDBMSs offer this feature as well.

My searches in Listings 9 and 10 yielded 10 results (your results will vary depending on how many documents you have indexed). So this simple search pared 300 reviews down to just 10 that might be of interest to me. Note, though, that the ratings range from 3.0 to 4.0. A more complex query should get me even closer to the top-rated music that I want to hear.

Adding ranges and filters

In Listing 11, I've imported some handy static methods that make building complex queries a bit easier. Ultimately what I'm doing is fashioning a query that finds any documents whose brief contains the word jazz and whose rating is between 3.5 and 4.0. This will trim down the earlier search result and increase my chances of finding quality music that suits my preference for jazz.

Listing 11. Searching with ranges and filters using Jest

import static org.elasticsearch.index.query.FilterBuilders.rangeFilter;
import static org.elasticsearch.index.query.QueryBuilders.filteredQuery;
import static org.elasticsearch.index.query.QueryBuilders.termQuery;

//later in the code

QueryBuilder queryBuilder = filteredQuery(termQuery("brief", "jazz"), 
  rangeFilter("rating").from(3.5).to(4.0));

Search search = new Search(queryBuilder);
search.addIndex("music_reviews");
search.addType("review");
JestResult result = client.execute(search);

List<MusicReview> reviewList = result.getSourceAsObjectList(MusicReview.class);
for(MusicReview review: reviewList){
  System.out.println("search result is " + review);
}

Remember that I can do the same exact search using cURL:

Listing 12. Searching with ranges and filters using cURL

curl -XGET 'http://192.168.1.11:9200/music_reviews/_search?pretty=true' -d
  '{"query": { "filtered" : { "filter" : {  "range" : { "rating" : 
     {"from": 3.5, "to":4.0} } },
     "query" : { "term" : { "brief" : "jazz" } } } }}'

This most recent search further trimmed my results and left me with some promising albums to listen to. But what if I want to get even more specific? Earlier, I mentioned that I'm a fan of Buddy Guy, who is a blues guitarist. So let's see what happens if I add that wildcard to my search, shown in Listing 13:

Listing 13. Searching with wild cards

import static org.elasticsearch.index.query.QueryBuilders.wildcardQuery;
//later in the code
QueryBuilder queryBuilder = filteredQuery(wildcardQuery("brief", "buddy*"), 
  rangeFilter("rating").from(3.5).to(4.0));
//see listing 12 for the template search and response

In Listing 13, I'm looking for any review whose rating is between 3.5 and 4.0 and whose brief contains the word buddy. I might get one or two reviews that reference Buddy Guy, in which case I'd be almost certain to like what I heard. On the other hand, I could get some more random documents that contain the word buddy— that's the downside of a generic wildcard search.

In this case, my wildcard paid off: I retrieved two documents whose reviews indicate blues-style music influenced by my favorite guitarist. Not bad for a day's work!

Working with token analyzers

For this article, I've kept things simple with respect to ElasticSearch's configurations; we haven't configured a cluster or really altered any of its default indexing strategies. Much greater sophistication is possible with ElasticSearch than I have shown. For example, when defining an index mapping, it's possible to configure how a particular field is indexed. Various tokenizer strategies will help you build very powerful and complex searches if you need to. In the case of the USA Todaybrief element, for instance, we could have specified a snowball analyzer or a keyword one. Snowball is a token algorithm that converts words to their base, thus expanding the field of the search. (Reducing the word jazzy to jazz, for instance.) Working with different analyzers is an excellent way to fine-tune your application's search capability. And using a search platform like ElasticSearch puts those options at your fingertips, without requiring you to roll your own.

In conclusion

Search is no longer optional: it's an expected feature of most any application that consumes, produces, or stores data. Not everyone wants to be a search technology specialist, however, especially given the range of sophisticated algorithms underlying today's complex searches. Knowing about existing, open source search platforms could save you a lot of time and money and allow you to spend your time fine-tuning your software's main functionality.

In this article. I introduced ElasticSearch, a distributed search platform that is easy to get started with and vastly extendable. ElasticSearch's sophistication and ease-of-use are impressive, and its support for horizontal scalability offers a world of options should your data requirements need to scale. (Whose don't, these days?)