Elasticsearch custom analyzer to ignore special characters

Posted By : Mohd Adnan | 31-Dec-2018

Elasticsearch Analyzer

Image Credits: Elasticsearch

Elasticsearch has a wide range of in-built analyzers that can be used with any index directly, without any further configuration. For example The Standard Analyzer, Simple Analyzer, WhiteSpace Analyzer, Keyword Analyzer, etc.

The default sorting in Elasticsearch is based on ASCII equivalents which provide sorting results by special characters followed by numbers, lowercase alphabets, and upper case alphabets.

 

Problem

 

To Achieve alphabetical sorting ignoring special characters and numbers

 

Solution

 

Using Elasticsearch 6, this can be achieved using Custom Analyzer when in-built analyzers do not fulfill your needs.

The approach is to write a custom analyzer that ignores non-alphabetical characters and then query against that field

 

Step 1:  Create a custom analyzer by using pattern replace character filter

 

Define a pattern replace character filter to remove any non-alphabetical characters on the index settings

 

"char_filter": {
  "alphabets_char_filter": {
    "type": "pattern_replace",
    "pattern": "[^a-zA-Z]",
    "replacement": ""
  }
}

 

Then use that filter to create a custom analyzer that we created “alphabets_char_filter” on the index above:  

"analysis": {
  "analyzer": {
    "alphabetsStringAnalyzer": {
      "tokenizer": "standard",
      "filter": "lowercase",
      "type":"custom",
      "char_filter": [
        "alphabets_char_filter"
      ]
    }
  },
  "char_filter": {
    "alphabets_char_filter": {
      "type": "pattern_replace",
      "pattern": "[^a-zA-Z]",
      "replacement": ""
    }
  }
}


Step 2: Define field mapping of the index using the custom analyzer


The next step is to define a new field mapping that uses the new “alphabetsStringAnalyzer” analyzer:

"title": {
  "type": "text",
  "fields": {
    "raw": {
      "type": "text",
      "analyzer": "alphabetsStringAnalyzer",
      "fielddata" : true
    }
  }
}

 

Step 3: Run query against a new field  

 

{
  "sort": {
    "title.raw": "asc"
  },
  "query": {
    "term": {
      "title": "random"
    }
  }
} 


This will provide alphabetical sorting, ignoring the non-alphabetical characters which were the expected result.


Hope that helps.

 

 

About Author

Author Image
Mohd Adnan

Adnan, an experienced Backend Developer, boasts a robust expertise spanning multiple technologies, prominently Java. He possesses an extensive grasp of cutting-edge technologies and boasts hands-on proficiency in Core Java, Spring Boot, Hibernate, Apache Kafka messaging queue, Redis, as well as relational databases such as MySQL and PostgreSQL. Adnan consistently delivers invaluable contributions to a variety of client projects, including Vision360 (UK) - Konfer, Bitsclan, Yogamu, Bill Barry DevOps support, enhedu.com, Noorisys, One Infinity- DevOps Setup, and more. He exhibits exceptional analytical skills alongside a creative mindset. Moreover, he possesses a fervent passion for reading books and exploring novel technologies and innovations.

Request for Proposal

Name is required

Comment is required

Sending message..