Blog

  • Elasticsearch Analyzer

    Image Credits: Elasticsearch

    Elasticsearch has a wide range of in-built analyzers that can be used with any index directly, without any further configuration. For example The Standard Analyzer, Simple Analyzer, WhiteSpace Analyzer, Keyword Analyzer, etc.

    The default sorting in Elasticsearch is based on ASCII equivalents which provide sorting results by special characters followed by numbers, lowercase alphabets, and upper case alphabets.

     

    Problem

     

    To Achieve alphabetical sorting ignoring special characters and numbers

     

    Solution

     

    Using Elasticsearch 6, this can be achieved using Custom Analyzer when in-built analyzers do not fulfill your needs.

    The approach is to write a custom analyzer that ignores non-alphabetical characters and then query against that field

     

    Step 1:  Create a custom analyzer by using pattern replace character filter

     

    Define a pattern replace character filter to remove any non-alphabetical characters on the index settings

     

    "char_filter": {
      "alphabets_char_filter": {
        "type": "pattern_replace",
        "pattern": "[^a-zA-Z]",
        "replacement": ""
      }
    }

     

    Then use that filter to create a custom analyzer that we created “alphabets_char_filter” on the index above:  

    "analysis": {
      "analyzer": {
        "alphabetsStringAnalyzer": {
          "tokenizer": "standard",
          "filter": "lowercase",
          "type":"custom",
          "char_filter": [
            "alphabets_char_filter"
          ]
        }
      },
      "char_filter": {
        "alphabets_char_filter": {
          "type": "pattern_replace",
          "pattern": "[^a-zA-Z]",
          "replacement": ""
        }
      }
    }
    


    Step 2: Define field mapping of the index using the custom analyzer


    The next step is to define a new field mapping that uses the new “alphabetsStringAnalyzer” analyzer:

    "title": {
      "type": "text",
      "fields": {
        "raw": {
          "type": "text",
          "analyzer": "alphabetsStringAnalyzer",
          "fielddata" : true
        }
      }
    }
    

     

    Step 3: Run query against a new field  

     

    {
      "sort": {
        "title.raw": "asc"
      },
      "query": {
        "term": {
          "title": "random"
        }
      }
    } 


    This will provide alphabetical sorting, ignoring the non-alphabetical characters which were the expected result.


    Hope that helps.

     

     

Tags: MachineLearning , Web Apps