Jan 13, 2018

Metric and Bucketing aggregation: aggregation queries in elastic search

In previous posts Boolean compound queriesFull Text QueryQuery term and Source filtering, Elasticsearch in filter context we performed search query and filter context and retrieved documents and specific fields of our interest. For analytics purpose we want result oriented response instead of verbose document. The man agenda of this post is to use Elasticsearch aggregation to execute analytical queries and get appropriate concise response.

Prerequisite :-
1. Elasticsearch instance is up and running.
2. Prepare dataset for analytical queries.
  • Open https://www.json-generator.com/ in browser.
  • Copy and paste below sample template to generate compact form of JSON & save it in file "employee.json"
  • [
      '{{repeat(1000, 1000)}}',
      {
        name: '{{firstName()}} {{surname()}}',
        age: '{{integer(18, 75)}}',
        salary:'{{integer(10000,80000)}}',
        gender: '{{gender()}}',
        email: '{{email()}}',
        phone: '+1 {{phone()}}',
        street: '{{integer(100, 999)}} {{street()}}',
        city: '{{city()}}',
        state: '{{state()}}, {{integer(100, 10000)}}'
      }
    ]
3. Clean and format JSON file : (Using sublime text regex model)
  --> Remove [ and ] from file.
  --> Using regex replace method replace "},{" with "}\n{"
  --> Append index info before each payload:  Find-Replace {"name" with {"index" : {}}\n{"name"
If you have processed file then Download cleaned and processed file.
4. Update json fie with new line at end of file.
5. Using "_bulk" API update all document with indices employee.
curl -H "Content-Type: application/x-ndjson" -XPOST 'localhost:9200/employees/personal/_bulk?pretty&refresh' --data-binary @"employees.json"
6. Validate 1000 documents are in place in index employees with type name personal
➜  Desktop curl -XGET 'localhost:9200/_cat/indices?v&pretty'                                                                                           
health status index     uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   employee  LU8xvoyMRwi-0o5K2JCyMg   5   1        100            0     86.1kb         86.1kb
yellow open   employees 3MQomR4CSRCLYYzywkZ9vg   5   1       1000            0     94.1kb         94.1kb
yellow open   customers KSbOq8eySwGgvJdH7VfQWQ   5   1       1000            0    485.1kb        485.1kb
yellow open   products  Lf8I7-H1QPeU6DrDyHWb1A   5   1          8            0     17.5kb         17.5kb

Metrics aggregations

Find average salary of all employees:  Below query is an example of aggregation query. It finds average salary of all employee. Here "aggs" indicates that query is of type aggregation, "avg_age" is filed name where result is assigned and "avg" is aggregate functions to compute average. Finally, field name indicates that - average is computed on which field. Since aggregation query is not interested in document so size= 0.
➜  Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d'
{
   "size" : 0,
    "aggs" : {
        "avg_age" : {
             "avg" : {
                 "field" : "salary"
             }
         }
    }
}
' -H 'Content-Type: application/json'
{
  "took" : 17,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_age" : {
      "value" : 44966.073
    }
  }
}

Aggregate query(metric aggregation)with search query
: Below query finds average salary for all employee - either she is female employee or  whose state value contains "Mississippi".
➜  Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d'
{
    "size" : 0,
     "query" : {
          "bool" : {
                "should": [
                    { "match": { "state": "Mississippi" } },
                    { "match": { "gender": "female" } }
                  ]
           }
     },
    "aggs" : {
        "avg_age" : {
             "avg" : {
                 "field" : "salary"
             }
         }
    }
}
' -H 'Content-Type: application/json'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 483,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_age" : {
      "value" : 44222.351966873706
    }
  }
}

Elasticsearch stats aggregation
-  Wide range of statistics in one query. Below query uses "stats" aggregation to find various statistics on salary fields. it gives min, max, average and sum of salary in one go.
➜  Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d'
{
    "size" : 0,
    "aggs" : {
        "age_stats" : {
             "stats" : {
                 "field" : "salary"
             }
         }
    }
}
' -H 'Content-Type: application/json'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_stats" : {
      "count" : 1000,
      "min" : 10026.0,
      "max" : 79968.0,
      "avg" : 44966.073,
      "sum" : 4.4966073E7
    }
  }
}

Number of Unique values of given filed (cardinality of field) : Below query find number of unique age value in entire document. "cardinality" aggregation gives number of unique values for  age field.
➜  Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d'
{
   "size" : 0,
    "aggs" : {
        "age_count" : {
             "cardinality" : {
                 "field" : "age"
             }
         }
    }
}
' -H 'Content-Type: application/json'
{
  "took" : 18,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_count" : {
      "value" : 58
    }
  }
}
Note:- Since Fielddata is disabled on text fields by default, if we replace field name as "gender", elastic search throws exception - "Fielddata is disabled on text fields by default. Set fielddata=true on [gender] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."

What is "fielddata" in context of Elasticsearch(ELS) ?
In ELS Text field values are stored in an in-memory data structure called fielddata. fielddata is built on demand when a field is used for aggregations, sorting etc. Since fielddata on text fields take up lots of heap space so fielddata is disabled by default on text fields.

How to enable fielddata on textfield in Elasticsearch ?

Using "_mapping" API of elasticsearch we can enable fielddata on textfield. Below sample query enable fielddata on gender.
➜  Desktop curl -XPUT 'localhost:9200/employees/_mapping/personal?pretty' -d'
{
  "properties": {
    "gender": {
      "type":     "text",
      "fielddata": true
    }
  }
}
' -H 'Content-Type: application/json'
{
  "acknowledged" : true
}

Once fielddata is enabled on gender we can execute query associated with cardinality aggregation with textfield . Below query display cardinality of gender as 2.
➜  Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d'
{
   "size" : 0,
    "aggs" : {
        "age_count" : {
             "cardinality" : {
                 "field" : "gender"
             }
         }
    }
}
' -H 'Content-Type: application/json'
{
  "took" : 23,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_count" : {
      "value" : 2
    }
  }
}

Bucketing aggregations

In elastic search each document is indexed and associated with some type name(logical grouping). For analytics purpose, we can logically group these indexed documents into buckets and each bucket satisfies some criterion, it is termed as Bucketing aggregation. Bucketing aggregation is perfomed using "_search" API.

Bucket aggregations by field values : Below query execute term aggregation query and divide all documents in index employee into two category. All 1000 documents is split into two blocks (524 and 476) documents.
➜  Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d'
{
   "size" : 0,
   "aggs" : {
        "gender_bucket" : {
             "terms" : {
                 "field" : "gender"
             }
         }
    }
}
' -H 'Content-Type: application/json'
{
  "took" : 16,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "gender_bucket" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "male",
          "doc_count" : 524
        },
        {
          "key" : "female",
          "doc_count" : 476
        }
      ]
    }
  }
}

Range aggregation with key/value response:  With <"keyed" : true>  we indicate that response of this query will gives in key/value pair. Below query displays response as key/value pair.
➜  Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d'
{
   "size" : 0,
   "aggs" : {
       "age_ranges" : {
           "range" : {
               "field" : "age",
               "keyed" : true,
               "ranges" : [
                   { "to" : 30 },
                   { "from" : 30, "to" : 40 },
                   { "from" : 40, "to" : 55 },
                   { "from" : 55 }
                ]
            }
        }
     }
}
' -H 'Content-Type: application/json'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_ranges" : {
      "buckets" : {
        "*-30.0" : {
          "to" : 30.0,
          "doc_count" : 216
        },
        "30.0-40.0" : {
          "from" : 30.0,
          "to" : 40.0,
          "doc_count" : 174
        },
        "40.0-55.0" : {
          "from" : 40.0,
          "to" : 55.0,
          "doc_count" : 248
        },
        "55.0-*" : {
          "from" : 55.0,
          "doc_count" : 362
        }
      }
    }
  }
} 
Range aggregation with custom key value : Here we are passing key values in as input and same is displayed as key in output like "young", "quarter-aged", etc. instead of  "30, "to" : 40" as displayed above.
➜  Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d'
{
   "size" : 0,
   "aggs" : {
       "age_ranges" : {
           "range" : {
               "field" : "age",
               "keyed" : true,
               "ranges" : [
                   { "key": "young", "to" : 35 },
                   { "key": "quarter-aged", "from" : 35, "to" : 45 },
                   { "key": "middle-aged", "from" : 45, "to" : 65 },
                   { "key": "senior", "from" : 55 }
                ]
            }
        }
     }
}
' -H 'Content-Type: application/json'
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_ranges" : {
      "buckets" : {
        "young" : {
          "to" : 35.0,
          "doc_count" : 302
        },
        "quarter-aged" : {
          "from" : 35.0,
          "to" : 45.0,
          "doc_count" : 169
        },
        "middle-aged" : {
          "from" : 45.0,
          "to" : 65.0,
          "doc_count" : 324
        },
        "senior" : {
          "from" : 55.0,
          "doc_count" : 362
        }
      }
    }
  }
}

Nested aggregation - Multi-level aggregation 

Metric aggregation inside Bucketing aggregation (Two level nesting) : Find average age of each gender. Below query first perform bucketing aggregation and then average age is computed. Outer "aggs" keyword specifies a gender bucket and inner "aggs" does average computation.
➜  Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d'
{
   "size" : 0,
   "aggs" : {
        "gender_bucket" : {
             "terms" : {
                 "field" : "gender"
             },
             "aggs": {
                 "average_age": {
                      "avg": {
                          "field": "age"
                      }
                 }
              }
         }
    }
}
' -H 'Content-Type: application/json'
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "gender_bucket" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "male",
          "doc_count" : 524,
          "average_age" : {
            "value" : 47.333969465648856
          }
        },
        {
          "key" : "female",
          "doc_count" : 476,
          "average_age" : {
            "value" : 45.71848739495798
          }
        }
      ]
    }
  }
}

Find average age of male and female within range of age (3 Level nesting):  Outermost "aggs" specifies bucketing aggregation which divides documents into two bucket (male and Female). Second "aggs" split buckets into range of age and finally inner "aggs" find average age of each rage of age.
➜  Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d'
{
   "size" : 0,
   "aggs" : {
        "gender_bucket" : {
             "terms" : {
                 "field" : "gender"
             },
             "aggs" : {
                 "age_ranges" : {
                     "range" : {
                         "field" : "age",
                         "keyed" : true,
                         "ranges" : [
                             { "key": "young", "to" : 35 },
                             { "key": "middle-aged", "from" : 35, "to" : 50 },
                             { "key": "senior", "from" : 55 }
                          ]
                      },
                      "aggs": {
                          "average_age": {
                               "avg": {
                                   "field": "age"
                               }
                          }
                       }
                  }
               }
         }
    }
}
' -H 'Content-Type: application/json'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "gender_bucket" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "male",
          "doc_count" : 524,
          "age_ranges" : {
            "buckets" : {
              "young" : {
                "to" : 35.0,
                "doc_count" : 145,
                "average_age" : {
                  "value" : 26.048275862068966
                }
              },
              "middle-aged" : {
                "from" : 35.0,
                "to" : 50.0,
                "doc_count" : 135,
                "average_age" : {
                  "value" : 42.19259259259259
                }
              },
              "senior" : {
                "from" : 55.0,
                "doc_count" : 192,
                "average_age" : {
                  "value" : 65.734375
                }
              }
            }
          }
        },
        {
          "key" : "female",
          "doc_count" : 476,
          "age_ranges" : {
            "buckets" : {
              "young" : {
                "to" : 35.0,
                "doc_count" : 157,
                "average_age" : {
                  "value" : 25.840764331210192
                }
              },
              "middle-aged" : {
                "from" : 35.0,
                "to" : 50.0,
                "doc_count" : 112,
                "average_age" : {
                  "value" : 41.517857142857146
                }
              },
              "senior" : {
                "from" : 55.0,
                "doc_count" : 170,
                "average_age" : {
                  "value" : 65.5
                }
              }
            }
          }
        }
      ]
    }
  }
}

Filter aggregation  

Find average salary of all employees who belongs to minnesota. It to also an nested aggregation where first filtering is applied followed by aggregation is applied to find average salary.
➜  Desktop curl -XPOST 'localhost:9200/employees/_search?size=0&pretty' -d'
{
    "aggs" : {
        "state" : {
            "filter" : { "term": { "state": "minnesota" } },
            "aggs" : {
                "avg_age" : { "avg" : { "field" : "salary" } }
            }
        }
    }
}
' -H 'Content-Type: application/json'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "state" : {
      "doc_count" : 17,
      "avg_age" : {
        "value" : 43855.35294117647
      }
    }
  }
}

====******======
Location: Bengaluru, Karnataka, India