<?xml version="1.0" encoding="utf-8"?>
    <feed xmlns="http://www.w3.org/2005/Atom">
     <title>BigBinary Blog</title>
     <link href="https://www.bigbinary.com/feed.xml" rel="self"/>
     <link href="https://www.bigbinary.com/"/>
     <updated>2026-06-08T05:36:56+00:00</updated>
     <id>https://www.bigbinary.com/</id>
     <entry>
       <title><![CDATA[Improving search experience using Elasticsearch]]></title>
       <author><name>Sayooj Surendran</name></author>
      <link href="https://www.bigbinary.com/blog/elasticsearch-improvements"/>
      <updated>2024-10-29T12:00:00+00:00</updated>
      <id>https://www.bigbinary.com/blog/elasticsearch-improvements</id>
      <content type="html"><![CDATA[<p>We use Elasticsearch in <a href="https://neeto.com/neetocourse">NeetoCourse</a> for oursearching needs. Recently, we have made some changes to the Elasticsearch config toimprove the search experience. In this blog, we will share the changes we madeand what we learned during the process.</p><h2>Definitions</h2><p>These are some of the Elasticsearch terminology we use in this blog.</p><ul><li><p><strong>Document:</strong> A document in Elasticsearch is similar to a row in a databasetable. It is a collection of key-value pairs.</p></li><li><p><strong>Index:</strong> An index is a collection of documents. It is similar to a databasetable. Indexing is the process of creating the index, and we canconfigure each step of this process.</p></li><li><p><strong>Analyzer:</strong> An analyzer converts a string into a list of searchable tokens.Analyzer contains three functions: Character Filter, Tokenizer, and TokenFilter.</p></li><li><p><strong>Character Filter</strong> is a function that performs the process of filtering outcertain characters from the input string. For example, to strip HTML tags andto get only the body.</p></li><li><p><strong>Tokenizer:</strong> A tokenizer is a function that splits the input string intotokens. This can be based on whitespace, punctuation, or any other character.</p></li><li><p><strong>Token Filter:</strong> A token filter is a function that performs the process offiltering out certain tokens from the input string. For example, it can beused to remove stop words (a, the, and, etc.) which serve no purpose in thesearch.</p></li></ul><h2>Analyzer</h2><p>This is our analyzer setup:</p><pre><code class="language-js">{  default: {    tokenizer: &quot;whitespace&quot;,    filter: [&quot;lowercase&quot;, &quot;autocomplete&quot;]  },  search: {    tokenizer: &quot;whitespace&quot;,    filter: [&quot;lowercase&quot;]  },  english_exact: {    tokenizer: &quot;whitespace&quot;,    filter: [      &quot;lowercase&quot;    ]  }}</code></pre><p><code>default</code> is the analyzer used for indexing and searching. The search terms fromthe user is passed through the <code>search</code> analyzer. <code>english_exact</code> is theanalyzer used for exact matches.</p><h3>Tokenizer</h3><p>By default, Elasticsearch uses the <code>standard</code> tokenizer, which splits the inputstring into tokens based on whitespace, punctuation, and any other character.Since our content is mostly based on technical concepts and programming, wecannot use the <code>standard</code> tokenizer. The <code>whitespace</code> tokenizer splits the inputstring into tokens based on whitespace, which is suitable for our use case.Hence, we use the <code>whitespace</code> tokenizer for all the analyzers.</p><h3>Filter</h3><p>The <code>lowercase</code> filter is used to convert all the tokens to lowercase beforestoring it in the index. This is a common requirement as we want the search tobe case-insensitive. We also use the custom <code>autocomplete</code> filter. Let's see itsdefinition:</p><pre><code class="language-js">{  autocomplete: {    &quot;type&quot;: &quot;edge_ngram&quot;,    &quot;min_gram&quot;: 3,    &quot;max_gram&quot;: 20,    &quot;preserve_original&quot;: true  }}</code></pre><p>The custom <code>autocomplete</code> filter is an implementation of<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenfilter.html">edge_ngram</a>token filter. Let's see the result of this filter when applied on the phrase&quot;Elephant in the room&quot;.</p><pre><code class="language-js">{  &quot;tokens&quot;: [    {      &quot;token&quot;: &quot;Ele&quot;,      &quot;start_offset&quot;: 0,      &quot;end_offset&quot;: 8,      &quot;type&quot;: &quot;&lt;ALPHANUM&gt;&quot;,      &quot;position&quot;: 0    },    {      &quot;token&quot;: &quot;Elep&quot;,      &quot;start_offset&quot;: 0,      &quot;end_offset&quot;: 8,      &quot;type&quot;: &quot;&lt;ALPHANUM&gt;&quot;,      &quot;position&quot;: 0    },    {      &quot;token&quot;: &quot;Eleph&quot;,      &quot;start_offset&quot;: 0,      &quot;end_offset&quot;: 8,      &quot;type&quot;: &quot;&lt;ALPHANUM&gt;&quot;,      &quot;position&quot;: 0    },    {      &quot;token&quot;: &quot;Elepha&quot;,      &quot;start_offset&quot;: 0,      &quot;end_offset&quot;: 8,      &quot;type&quot;: &quot;&lt;ALPHANUM&gt;&quot;,      &quot;position&quot;: 0    },    {      &quot;token&quot;: &quot;Elephan&quot;,      &quot;start_offset&quot;: 0,      &quot;end_offset&quot;: 8,      &quot;type&quot;: &quot;&lt;ALPHANUM&gt;&quot;,      &quot;position&quot;: 0    },    {      &quot;token&quot;: &quot;Elephant&quot;,      &quot;start_offset&quot;: 0,      &quot;end_offset&quot;: 8,      &quot;type&quot;: &quot;&lt;ALPHANUM&gt;&quot;,      &quot;position&quot;: 0    },    {      &quot;token&quot;: &quot;in&quot;,      &quot;start_offset&quot;: 9,      &quot;end_offset&quot;: 11,      &quot;type&quot;: &quot;&lt;ALPHANUM&gt;&quot;,      &quot;position&quot;: 1    },    {      &quot;token&quot;: &quot;the&quot;,      &quot;start_offset&quot;: 12,      &quot;end_offset&quot;: 15,      &quot;type&quot;: &quot;&lt;ALPHANUM&gt;&quot;,      &quot;position&quot;: 2    },    {      &quot;token&quot;: &quot;roo&quot;,      &quot;start_offset&quot;: 16,      &quot;end_offset&quot;: 20,      &quot;type&quot;: &quot;&lt;ALPHANUM&gt;&quot;,      &quot;position&quot;: 3    },    {      &quot;token&quot;: &quot;room&quot;,      &quot;start_offset&quot;: 16,      &quot;end_offset&quot;: 20,      &quot;type&quot;: &quot;&lt;ALPHANUM&gt;&quot;,      &quot;position&quot;: 3    }  ]}</code></pre><p>The <code>edge_ngram</code> filter creates <a href="https://en.wikipedia.org/wiki/N-gram">n-grams</a>starting from the first character. Since we specified <code>min_gram</code> as 3 and<code>max_gram</code> as 20, the filter will create n-grams from the first 3 characters tothe last 20 characters. This means that a document will be created for every 3to 20 letter sequence of a word. If we start typing &quot;ELEP&quot;, then there will be adocument in the index corresponding to &quot;ELEP&quot; and we will get the result&quot;Elephant in the room&quot;.</p><h2>The Query</h2><p>The query object in the search request is equally important for getting relevantresults. Elasticsearch sorts matching search results by <em>relevance score</em>, whichmeasures how well each document matches a query. The query clause contains thecriteria for calculating the relevance score. This is our query object:</p><pre><code class="language-js">{  bool: {    should: [      {        simple_query_string: {          query: requestBody.searchTerm,          fields: [&quot;content&quot;, &quot;meta.pageTitle&quot;, &quot;meta.chapterTitle&quot;],          quote_field_suffix: &quot;.exact&quot;,        },      },      {        match: {          content: {            query: requestBody.searchTerm,            ...baseFuzzyQueryConfig(),          },        },      },      {        match: {          &quot;meta.pageTitle&quot;: {            query: requestBody.searchTerm,            ...baseFuzzyQueryConfig(),            boost: 1.5,          },        },      },      {        match: {          &quot;meta.chapterTitle&quot;: {            query: requestBody.searchTerm,            ...baseFuzzyQueryConfig(),            boost: 1.5,          },        },      },    ],    minimum_should_match: 1,  },};</code></pre><p>The <code>bool</code> and <code>should</code> clauses are used to create a compound query. The<code>should</code> clause means that at least one of the queries should match. Here the<code>match</code> is the standard query used for full-text search. Here <code>content</code>,<code>meta.pageTitle</code> and <code>meta.chapterTitle</code> specify the fields that we createdwhile indexing the data.</p><p>We have provided a <code>boost</code> value of 1.5 for page title and chapter title. Thisis to make sure that a page or chapter title has more relevance score than anycontent in the middle of the page.</p><p>The <code>simple_query_string</code> query is used for exact matches, when the search termcontains double quotes, the <code>english_exact</code> analyzer is used. The double quotesoperator ( <code>&quot;</code> ) is part of the<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html#simple-query-string-syntax">several operators</a>that can be used in the <code>simple_query_string</code> query.</p><h3>Fuzzy searching</h3><p>We also use fuzzy searching in the <code>match</code> query. Fuzzy searching helps ingiving proper results even if there are typos in the search term. Elasticsearchuses <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a>to calculate the similarity between the search term and the indexed data.Previously, we used the<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html">fuzzy query</a>to implement fuzzy searching.</p><pre><code class="language-js">{  content: {    value: requestBody.searchTerm,    ...extendedFuzzyQueryConfig(),  },},</code></pre><p>But this caused several issues like:</p><ul><li>Fuzzy results being prioritized over exact matches. For example, searching for&quot;five&quot; returned results for &quot;dive&quot;, even when &quot;five&quot; was present in thecontent</li><li>Fuzzy results not being returned when the search term contained multiplewords.</li></ul><p>Upon investigation, we found that the <code>fuzzy</code> query does not perform analysis onthe search term. Instead we now use the <code>match</code> query with <code>fuzziness</code>parameter.</p><pre><code class="language-js">const baseFuzzyQueryConfig = () =&gt; ({  prefix_length: 0,  fuzziness: &quot;AUTO&quot;,});...{  content: {    query: requestBody.searchTerm,    ...baseFuzzyQueryConfig(),  },},</code></pre><h2>Conclusion</h2><p>There is no silver bullet in the case of Elasticsearch configuration. And there isno metric to determine if the search is giving quality results. We have tweakedour configuration based on trial and error and from user feedback. We hope thesetechniques are useful for anyone who is looking to improve their searchexperience.</p>]]></content>
    </entry>
     </feed>