Elasticsearch custom analyzer to ignore special characters

Posted By Mohd Adnan | 31-Dec-2018

Elasticsearch Analyzer

Image Credits: Elasticsearch

Elasticsearch has a wide range of in-built analyzers that can be used with any index directly, without any further configuration. For example The Standard Analyzer, Simple Analyzer, WhiteSpace Analyzer, Keyword Analyzer, etc.

The default sorting in Elasticsearch is based on ASCII equivalents which provide sorting results by special characters followed by numbers, lowercase alphabets, and upper case alphabets.

 

Problem

 

To Achieve alphabetical sorting ignoring special characters and numbers

 

Solution

 

Using Elasticsearch 6, this can be achieved using Custom Analyzer when in-built analyzers do not fulfill your needs.

The approach is to write a custom analyzer that ignores non-alphabetical characters and then query against that field

 

Step 1:  Create a custom analyzer by using pattern replace character filter

 

Define a pattern replace character filter to remove any non-alphabetical characters on the index settings

 

"char_filter": {
  "alphabets_char_filter": {
    "type": "pattern_replace",
    "pattern": "[^a-zA-Z]",
    "replacement": ""
  }
}

 

Then use that filter to create a custom analyzer that we created “alphabets_char_filter” on the index above:  

"analysis": {
  "analyzer": {
    "alphabetsStringAnalyzer": {
      "tokenizer": "standard",
      "filter": "lowercase",
      "type":"custom",
      "char_filter": [
        "alphabets_char_filter"
      ]
    }
  },
  "char_filter": {
    "alphabets_char_filter": {
      "type": "pattern_replace",
      "pattern": "[^a-zA-Z]",
      "replacement": ""
    }
  }
}


Step 2: Define field mapping of the index using the custom analyzer


The next step is to define a new field mapping that uses the new “alphabetsStringAnalyzer” analyzer:

"title": {
  "type": "text",
  "fields": {
    "raw": {
      "type": "text",
      "analyzer": "alphabetsStringAnalyzer",
      "fielddata" : true
    }
  }
}

 

Step 3: Run query against a new field  

 

{
  "sort": {
    "title.raw": "asc"
  },
  "query": {
    "term": {
      "title": "random"
    }
  }
} 


This will provide alphabetical sorting, ignoring the non-alphabetical characters which were the expected result.


Hope that helps.

 

 

Request for Proposal

Recaptcha is required.

Sending message..