Sep 8, 2018

Textual description of firstImageUrl

Elasticsearch Inverted index(Analysis): How to create inverted index and how inverted index is stored in segments of Shards (Elasticsearch and Kibana Devtool)

Elasticsearch uses a special data structure called "Inverted index" for very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears. Inverted index is created from document created in elasticsearch. Inverted index is created using process called analysis (tokenisation and Filterization).

In this post we will see how inverted index are created and how it is stored in shards which later used for searching documents. 

Document to Searchable Index: When a document is created in elasticsearch(ELS). It goes through various phases and becomes eligible to searching. Below diagram gives overview of the same.
Complete cycle of Inverted index creation from document : An immutable shard's segment 
  • Client issues command for creating document in ELS. Refer this and this posts for more details, how to create document in ELS.  
  •  Once document is created in ELS, it goes through analysis phase where documents are tokenised and normalised (stemming, synonyms detected and remove stop words). Refer this for more details about Inverted index creation.
  • For a given document, inverted index is created and stored in temporary buffer till it become full. Once buffer is full, it is flushed into segments. 
  • A segment is smallest logical unit. Shard can be viewed as collection of segments. Segments are filled with flushed inverted index. 
  • Once a working segment is filled completely with inverted index, shards becomes eligible for searching. Segments created are immutable  - collection of immutable inverted index.  

Text Analysis for indexing and searching (Inverted index creation):
Analysis process is key step in creating inverted index in shards. Analysis is not only performed while creating document, it's also performed while retrieving or querying(GET) document. Below diagram shows how analysis is performed while indexing.
Analysis phase : Indexing of document 
Document text is tokenised and filtered by Analyzer (analyzer is setup while defining Index structure). After processing it created inverted index which is stored in shard's segment. Consider two document text below for analysis.
{
    "name" : "Nikhil",
    "id": "zytham",
    "comment" : "The thin lifeguard was swimming in the lake"
    "date" : "2018-02-12"
}
 
{
    "name" : "Ranjan",
    "id": "nranjan",
    "comment" : "Swimmers race with the skinny lifeguard in lake"
    "date" : "2018-02-12"
}

Lets assume we are interested in comment fields of document. We have two text to consider for analysis.
1. The thin lifeguard was swimming in the lake
2. Swimmers race with the skinny lifeguard in lake

Tokenisation: To create an inverted index, we first split the comment of each document into separate words (which we call terms, or tokens), create a sorted list of all the unique terms, and then list in which document each term appears.
i.e: Split each doc comment text with respect to space and we get following tokens and its presence in doc.
Token Present in Document
Swimmers 2
The 1
in 1,2
lifeguard 1,2
lake 1,2
race 2
skinny 2
swimming 2
the 1,2
thin 1
was 1
with 1
Filtering: After tokenisation filter is applied on these. Filters are such as:
  • Removing stop words (the, in, etc. of english word)
  • Lowercasing (To make search case insensitive)
  • stemming (swimming to swim)
  • synonymous ( thin == skinny )
Elasticsearch provides pre-builtin wide range of analysers which can be used in any index without further configuration. Here is list of elasticsearch builtin analyzer.

Text analysis while retrieving and querying document: When GET command is executed for retrieving document analyzer is used same as while indexing (described above). Below diagram shows match string "the thin" is passed through analyser and search is performed on "thin"-  stopping word "the" is removed.
Reference : Index time analysis and search time analysis details

Location: Bengaluru, Karnataka, India