Turning the Dials in Elasticsearch

I’ve spent the last few months working on a project that depended in large part on Elasticsearch. I’m a fan and so are lots of other people — Elasticsearch is the biggest name in open source search (with Solr as a close second).

What Elasticsearch and similar technologies do, on a basic level, is tokenize your documents into individual terms and put them into a data structure called an “inverted index.” In the inverted index those terms are sorted and mapped to individual documents. The fact that terms are sorted allows for super fast retrieval of search results, even when there are a huge number of documents. That’s a simplification, but it gives you a general idea of the primary advantage Elasticsearch offers: it can perform full text searches really fast.

Elasticsearch is popular in part because it can be used as an almost turnkey solution. With a minimum of configuration ES is performant, scalable, and easy to integrate with any application that’s friendly to REST and JSON.

But if the data you’re indexing has any degree of complexity, using out-of-the-box Elasticsearch might be a missed opportunity.

The search engine really begins to feel powerful when you dig into all the ways you can tweak and optimize your search results. Elasticsearch offers a huge array of configuration options to customize how your search index will respond to queries. API’s for monitoring and analytics make it easy to look behind the curtain and see why a query returned a particular result, or how a document is stored in the index. You can do cool things like:

Add extra weight to high-priority documents.
Make one field of a document more important than another.
Create your own custom algorithm to determine how results are ordered.
Ignore or give extra weight to specific words in a document.

Unsurprisingly, the documentation for Elasticsearch is vast and can be difficult to parse. As a starting place, here’s a quick guide to concepts that are useful to understand when tuning an Elasticsearch instance:

Relevancy

People throw around the term “relevance” when they talk about search to mean all kinds of things, but in Elasticsearch-land relevancy has a very specific meaning. When your index is queried, Elasticsearch uses an algorithm to calculate a relevance score for each document to determine which documents to return, and how to order the results. The algorithm considers three different factors:

Term frequency: How often the search term appears in the text of the document. More is better.
Inverse document frequency: How often each search term shows up in the entire search index. If it shows up all over the place, we aren’t going to attach as much weight to it.
Field length: How long the text in the document is. If it’s really short, it’s more likely that a match with the search term is significant.

Function Score

It may be that the relevance score doesn’t return things in the order that feels most natural, based on the nature of your data. For instance, if your application had a “like” functionality you might want to boost items that were more popular. A function score query allows you to calculate a new score by combining the original score with your own metric in a few different ways: multiply, sum, min, max, etc.

GET /_search
{
  "query": {
    "function_score": {
      "query": {
        "match": { "name": "Severus Snape" }
      },
      "field_value_factor": { "field": "likes" },
      "boost_mode": "sum"
    }
  }
}

Fuzziness

Setting fuzziness to “true” for a query allows you to match documents based on a measure known as the Levenshtein distance, the minimum number of single character edits necessary to turn one word into another. So if you set fuzziness to 1, a search for “fast” might return “east”, “fest”, or “eat”. To avoid a fast/east type of scenario, you can use a param called prefix_length to specify how many letters need to match up exactly at the beginning of a word.

GET /_search
{
  "query": {
    "match": {
      "title": {
        "query": "Quidditch through the Ages",
        "fuzziness": 3,
        "prefix_length": 2
      }  
    }
  }
}

Bool Query

When a simple query doesn’t provide enough granularity, you might need to consider a multi-part bool query. A bool query may include a “must” query: all documents returned need to fulfill the query’s criteria. It might also include a “should” clause: documents don’t necessarily need to fulfill the query’s criteria, but will be bumped to the top if they do. I used this strategy to prioritize buildings over rooms in a location search. Other query types include “filter” and “must_not”, documented here.

GET /_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "title": { "query": "Hedwig" }
         }
      },  
      "should": {
        "match": {
          "animal": { 
            "query": "owl", 
            "boost": 3 
        }, 
      }
    }
  }
}

Field Boosting

If your documents have more than one field (for instance, title and author), there’s a good chance you care about one field more than another. Elasticsearch allows you to “boost” a particular field either at index or search time by a numerical value that indicates how strong of a preference you’d like to give it.

GET /_search
{
  "query": {
    "multi_match": {
      "query": "Nimbus 3000",
      "fields": ["model", "brand^2"]
    }
  }
}

N-Grams

In linguistic terms, an n-gram refers to a grouping of one or more contiguous words or letters in a piece of text. You might have heard a bigram, which could refer to a two-word phrase, or two letters that appear together in a word. An n-gram is the same concept, but with any number of tokens (words/letters). Elasticsearch’s edge n-gram tokenizer splits up text into words, and creates n-grams starting from the beginning of each word (so for “jazz” you’d create the ngrams “j”, “ja”, “jaz”, and “jazz”). We care about the n-gram tokenizer because it provides a handy means to create “search-as-you-type” functionality — more info in the documentation here.

PUT /lindas_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer" : "autocomplete",
          "filter": [
            "lowercase",
            "stop",
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10,
          "token_chars": [ "letter" ]
        }
      }
    }
  }
}

Stop Words

Stop words are words that we want to filter out, because they are so common as to be meaningless for search. For instance, if you searched “harry potter and the sorcerers stone” it would be pretty absurd to return every document containing the words “and” and “the”; anything that’s not wizard-related is just noise. When you create a new index, you can specify which stop words should be filtered out — either a default list of common English stop words, or your own custom list.

PUT /lindas_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "lindas_custom_analyzer": {
          "tokenizer" : "standard",
          "filter": [
            "lowercase",
            "stop",
            "lindas_custom_stop"
          ]
        }
      },
      "filter": {
        "lindas_custom_stop" : {
          "type': "stop",
          "stopwords": ["muggle", "wizard"]
         }
       }
    }
  }
}

Index Time vs Query Time (Performance Implications)

Many of the strategies described above deal with parameters passed at query time —when you’re making a query to the index in real time. However, in some cases it is possible to analyze data at index time — when the data is initially posted to the index. If you are querying more often than you are indexing (this tends to be the case), there may be a performance gain to be had by trying to analyze documents at index time rather than query time.

A good example is the n-gram analyzer described above. It’s possible to get autocomplete functionality at query time by using the match_phrase_prefix param, but creating edge n-grams at index time is likely to be slightly more performant. There’s a caveat here; depending on the size or your index and the number of queries the performance gain may not be all that significant. If your primary goal is to get something running quickly and simply, query time analysis is probably a-ok.

And that’s just the tip of the iceberg! These were all capabilities that came in handy for my most recent project, but your application may have different needs entirely. Check out the docs for a full reference.

Linda Gorman