Go to content
The Codest
  • About Us
    • Staff Augmentation
    • Project Development
    • Cloud Engineering
    • Quality Assurance
    • Web Development
  • Our Team
  • Case studies
    • Blog
    • Meetups
    • Webinars
    • Resources
Careers Get in touch
  • About Us
    • Staff Augmentation
    • Project Development
    • Cloud Engineering
    • Quality Assurance
    • Web Development
  • Our Team
  • Case studies
    • Blog
    • Meetups
    • Webinars
    • Resources
Careers Get in touch
Content
2016-02-21
Software Development

ELASTICSEARCH GOTCHAS - PART 1

Michal Rogowski

Hello friends! There are people with different levels of experience contributing to our forums – and that’s great! But right now, I’m looking for true Ruby titans!

Elasticsearch is a search engine that is based on a trusted and mature library - Apache Lucene. Huge activity in git project repository and the implementation in such projects as GitHub, SoundCloud, Stack Overflow and LinkedIn bear testimony to its great popularity. The part “Elastic” says it all about the nature of the system, whose capabilities are enormous: from a simple file search on a small scale, through knowledge discovery, to big data analysis in real time.What makes Elastic a more powerful than the competition is the set of default configurations and behaviors, which allow to create a cluster and start adding documents to the index in a couple of minutes. Elastic will configure a cluster for you, will define an index and define the types of fields for the first document obtained, and when you add another server, it will automatically deal with dividing index data between servers.Unfortunately, the above mentioned automation makes it unclear to us what the default settings implicate, and it often turns out to be misleading. This article starts a series where I will be dealing with the most popular gotchas, which you might encounter during the process of Elastic-based app creation.

The number of shards cannot be changed

Let’s index the first document using index API:

$ curl -XPUT 'http://localhost:9200/myindex/employee/1' -d '{
    "first_name" :   "Jane",
    "last_name" :    "Smith",
    "steet_number":  12
  }'

In this moment, Elastic creates an index for us, titled myindex. What is not visible here is the number of shards assigned to the index. Shards can be understood as individual processes responsible for indexing, storing and searching of some part of documents of a whole index. During the process of document indexing, elastic decides in which shard a document should be found. That is based on the following formula:

shard = hash(document_id) % number_of_primary_shards

It is now clear that the number of primary shards cannot be changed for an index that contains documents. So, before indexing the first document, always create an index manually, giving the number of shards, which you think is sufficient for a volume of indexed data:

$ curl -XPUT 'http://localhost:9200/myindex/' -d '{
    "settings" : {
      "number_of_shards" : 10
    }
  }'

Default value for number_of_shards is 5. This means that the index can be scaled to up to 5 servers, which collect data during indexation. For the production environment, the value of shards should be set depending on the expected frequency of indexation and the size of documents. For development and testing environments, I recommend setting the value to 1 - why so? It will be explained in the next paragraph of this article.

Sorting the text search results with a relatively small number of documents

When we search for a document with a phrase:

$ curl -XGET 'http://localhost:9200/myindex/my_type/_search' -d
  '{
    "query": {
      "match": {
        "title": "The quick brown fox"
      }
    }
  }'

Elastic processes text search in few steps, simply speaking:

  1. phrase from request is converted into the same identical form as the document was indexed in, in our case it will be set of terms: [“quick”, “brown”, “fox”] (“the” is removed because it’s insignificant),
  2. the index is being browsed to search the documents that contain at least one of the searched words,
  3. every document that is a match, is evaluated in terms of being relevant to the search phrase,
  4. the results are sorted by the calculated relevance and the first page of results is returned to the user.

In the third step, the following values (among others) are taken into account:

  1. how many words from the search phrase are in the document
  2. how often a given word occurs in a document (TF - term frequency)
  3. whether and how often the matching words occur in other documents (IDF - inverse document frequency) - the more popular the word in other documents, the less significant
  4. how long is the document

The functioning of IDF is important to us. Elastic for performance reasons does not calculate this value regarding every document in the index - instead, every shard (index worker) calculates its local IDF and uses it for sorting. Therefore, during the index search with low number of documents we may obtain substantially different results depending on the number of shards in an index and document distribution.

Let’s imagine that we have 2 shards in an index; in the first one there are 8 documents indexed with the word “fox”, and in the second one only 2 documents with the same word. As a result, the word “fox” will differ significantly in both shards, and this may produce incorrect results. Therefore, an index consisting of only one primary shard should be created for development purposes:

$ curl -XPUT 'http://localhost:9200/myindex/' -d
  '{"settings" : { "number_of_shards" : 1 } }'

Viewing the results of “far” search pages kills your cluster

As I’ve written before in previous paragraphs, documents in an index are shared between totally individual index processes - shards. Every process is completely independent and deals only with the documents, which are assigned to it.

When we search an index with millions of documents and wait to obtain top 10 results, every shard must return its 10 best-matched results to the cluster’s node, which initiated the search. Then the responses from every shard are joined together and the top 10 search results are chosen (within the whole index). Such approach allows to efficiently distribute the search process between many servers.

Let’s imagine that our app allows viewing 50 results per page, without the restrictions regarding the number of pages that can be viewed by a user. Remember that our index consists of 10 primary shards (1 per server).

Let’s see how the acquiring of search results will look like for the 1st and the 100th page:

Page No. 1 of search results:

  1. The node which receives a query (controller) passes it on to 10 shards.
  2. Every shard returns its 50 best matching documents sorted by relevance.
  3. After the responses has been received from every shard, the controller merges the results (500 documents).
  4. Our results are the top 50 documents from the previous step.

Page No. 100 of search results:

  1. The node which receives a query (controller) passes it on to 10 shards.
  2. Every shard returns its 5000 best matching documents sorted by relevance.
  3. After receiving responses from every shard, the controller merges the results (50000 documents).
  4. Our results are the documents from the previous step positioned 4901 - 5000.

Assuming that one document is 1KB in size, in the second case it means that ~50MB of data must be sent and processed around the cluster, in order to view 100 results for one user.

It’s not hard to notice, that network traffic and index load increases significantly with each successive result page. That’s why it is not recommended to make the “far” search pages available to the user. If our index is well configured, than the user should find the result he’s interested in on the first search pages, and we’ll protect ourselves from unnecessary load of our cluster. To prove this rule, check, up to what number of search result pages do the most popular web search engines allow viewing.

What’s also interesting is the observation of browser response time for successive search result pages. For example, below you can find response times for individual search result pages in Google Search (the search term was “search engine”):

Search result page (10 documents per page)Response time
1250ms
10290ms
20350ms
30380ms
38 (last one available)

In the next part, I will look closer into the problems regarding document indexing.

Related articles

Software Development

3 Useful HTML Tags You Might Not Know Even Existed

Nowadays, accessibility (A11y) is crucial on all stages of building custom software products. Starting from the UX/UI design part, it trespasses into advanced levels of building features in code. It provides tons of benefits for...

Jacek Ludzik
Software Development

5 examples of Ruby’s best usage

Have you ever wondered what we can do with Ruby? Well, the sky is probably the limit, but we are happy to talk about some more or less known cases where we can use this powerful language. Let me give you some examples.

Pawel Muszynski
Software Development

Maintaining a Project in PHP: 5 Mistakes to Avoid

More than one article has been written about the mistakes made during the process of running a project, but rarely does one look at the project requirements and manage the risks given the technology chosen.

Sebastian Luczak
Software Development

Why you will find qualified Ruby developers in Poland?

Real Ruby professionals are rare birds on the market. Ruby is not the most popular technology, so companies often struggle with the problem of finding developers who have both high-level skills and deep experience; oh, and by the...

Jakub
Software Development

9 Mistakes to Avoid While Programming in Java

What mistakes should be avoided while programming in Java? In the following piece we answers this question.

Rafal Sawicki
Software Development

A quick dive into Ruby 2.6. What is new?

Released quite recently, Ruby 2.6 brings a bunch of conveniences that may be worth taking a glimpse of.  What is new? Let’s give it a shot!

Patrycja Slabosz

Subscribe to our knowledge base and stay up to date on the expertise from industry.

About us

The Codest – International Tech Software Company with tech hubs in Poland.

    United Kingdom - Headquarters

  • Office 303B, 182-184 High Street North E6 2JA London, England

    Poland - Local Tech Hubs

  • Fabryczna Office Park, Aleja Pokoju 18, 31-564 Kraków
  • Brain Embassy, Konstruktorska 11, 02-673 Warsaw, Poland

    The Codest

  • Home
  • About us
  • Services
  • Case studies
  • Know how
  • Careers

    Services

  • PHP development
  • Java development
  • Python development
  • Ruby on Rails development
  • React Developers
  • Vue Developers
  • TypeScript Developers
  • DevOps
  • QA Engineers

    Resources

  • What are top CTOs and CIOs Challenges? [2022 updated]
  • Facts and Myths about Cooperating with External Software Development Partner
  • From the USA to Europe: Why do American startups decide to relocate to Europe
  • Privacy policy
  • Website terms of use

Copyright © 2023 by The Codest. All rights reserved.

We use cookies on the site for marketing, analytical and statistical purposes. By continuing to use, without changing your privacy settings, our site, you consent to the storage of cookies in your browser. You can always change the cookie settings in your browser. You can find more information in our Privacy Policy.