The Codest
  • Sobre nós
  • Serviços
    • Desenvolvimento de software
      • Desenvolvimento de front-end
      • Desenvolvimento backend
    • Staff Augmentation
      • Programadores Frontend
      • Programadores de back-end
      • Engenheiros de dados
      • Engenheiros de nuvem
      • Engenheiros de GQ
      • Outros
    • Aconselhamento
      • Auditoria e consultoria
  • Indústrias
    • Fintech e Banca
    • E-commerce
    • Adtech
    • Tecnologia da saúde
    • Fabrico
    • Logística
    • Automóvel
    • IOT
  • Valor para
    • CEO
    • CTO
    • Gestor de entregas
  • A nossa equipa
  • Case Studies
  • Saber como
    • Blogue
    • Encontros
    • Webinars
    • Recursos
Carreiras Entrar em contacto
  • Sobre nós
  • Serviços
    • Desenvolvimento de software
      • Desenvolvimento de front-end
      • Desenvolvimento backend
    • Staff Augmentation
      • Programadores Frontend
      • Programadores de back-end
      • Engenheiros de dados
      • Engenheiros de nuvem
      • Engenheiros de GQ
      • Outros
    • Aconselhamento
      • Auditoria e consultoria
  • Valor para
    • CEO
    • CTO
    • Gestor de entregas
  • A nossa equipa
  • Case Studies
  • Saber como
    • Blogue
    • Encontros
    • Webinars
    • Recursos
Carreiras Entrar em contacto
Seta para trás VOLTAR
2016-02-21
Desenvolvimento de software

ELASTICSEARCH GOTCHAS – PART 1

Michal Rogowski

Hello friends! There are people with different levels of experience contributing to our forums – and that’s great! But right now, I’m looking for true Ruby titans!

Elasticsearch is a search engine that is based on a trusted and mature library – Apache Lucene. Huge activity in git projeto repository and the implementation in such projects as GitHub, SoundCloud, Stack Overflow and LinkedIn bear testimony to its great popularity. The part “Elastic” says it all about the nature of the system, whose capabilities are enormous: from a simple file search on a small scale, through knowledge discovery, to big data analysis in real time.What makes Elastic a more powerful than the competition is the set of default configurations and behaviors, which allow to create a cluster and start adding documents to the index in a couple of minutes. Elastic will configure a cluster for you, will define an index and define the types of fields for the first document obtained, and when you add another server, it will automatically deal with dividing index data between servers.Unfortunately, the above mentioned automation makes it unclear to nós what the default settings implicate, and it often turns out to be misleading. This article starts a series where I will be dealing with the most popular gotchas, which you might encounter during the process of Elastic-based app creation.

The number of shards cannot be changed

Let’s index the first document using index API:

 $ curl -XPUT 'http://localhost:9200/myindex/employee/1' -d '{
     "first_name" :   "Jane",
     "last_name" :    "Smith",
     "steet_number":  12
   }'

In this moment, Elastic creates an index for us, titled myindex. What is not visible here is the number of shards assigned to the index. Shards can be understood as individual processes responsible for indexing, storing and searching of some part of documents of a whole index. During the process of document indexing, elastic decides in which shard a document should be found. That is based on the following formula:

shard = hash(document_id) % number_of_primary_shards

It is now clear that the number of primary shards cannot be changed for an index that contains documents. So, before indexing the first document, always create an index manually, giving the number of shards, which you think is sufficient for a volume of indexed data:

 $ curl -XPUT 'http://localhost:9200/myindex/' -d '{
     "settings" : {
       "number_of_shards" : 10
     }
   }'

Default value for number_of_shards is 5. This means that the index can be scaled to up to 5 servers, which collect data during indexation. For the production environment, the value of shards should be set depending on the expected frequency of indexation and the size of documents. For development and testing environments, I recommend setting the value to 1 – why so? It will be explained in the next paragraph of this article.

Sorting the text search results with a relatively small number of documents

When we search for a document with a phrase:

 $ curl -XGET 'http://localhost:9200/myindex/my_type/_search' -d
   '{
     "query": {
       "match": {
         "title": "The quick brown fox"
       }
     }
   }'

Elastic processes text search in few steps, simply speaking:

  1. phrase from request is converted into the same identical form as the document was indexed in, in our case it will be set of terms: [“quick”, “brown”, “fox”] (“the” is removed because it’s insignificant),
  2. the index is being browsed to search the documents that contain at least one of the searched words,
  3. every document that is a match, is evaluated in terms of being relevant to the search phrase,
  4. the results are sorted by the calculated relevance and the first page of results is returned to the user.

In the third step, the following values (among others) are taken into account:

  1. how many words from the search phrase are in the document
  2. how often a given word occurs in a document (TF – term frequency)
  3. whether and how often the matching words occur in other documents (IDF – inverse document frequency) – the more popular the word in other documents, the less significant
  4. how long is the document

The functioning of IDF is important to us. Elastic for performance reasons does not calculate this value regarding every document in the index – instead, every shard (index worker) calculates its local IDF and uses it for sorting. Therefore, during the index search with low number of documents we may obtain substantially different results depending on the number of shards in an index and document distribution.

Let’s imagine that we have 2 shards in an index; in the first one there are 8 documents indexed with the word “fox”, and in the second one only 2 documents with the same word. As a result, the word “fox” will differ significantly in both shards, and this may produce incorrect results. Therefore, an index consisting of only one primary shard should be created for development purposes:

 $ curl -XPUT 'http://localhost:9200/myindex/' -d
   '{"settings" : { "number_of_shards" : 1 } }'

Viewing the results of “far” search pages kills your cluster

As I’ve written before in previous paragraphs, documents in an index are shared between totally individual index processes – shards. Every process is completely independent and deals only with the documents, which are assigned to it.

When we search an index with millions of documents and wait to obtain top 10 results, every shard must return its 10 best-matched results to the cluster’s nó, which initiated the search. Then the responses from every shard are joined together and the top 10 search results are chosen (within the whole index). Such approach allows to efficiently distribute the search process between many servers.

Let’s imagine that our app allows viewing 50 results per page, without the restrictions regarding the number of pages that can be viewed by a user. Remember that our index consists of 10 primary shards (1 per server).

Let’s see how the acquiring of search results will look like for the 1st and the 100th page:

Page No. 1 of search results:

  1. The node which receives a query (controller) passes it on to 10 shards.
  2. Every shard returns its 50 best matching documents sorted by relevance.
  3. After the responses has been received from every shard, the controller merges the results (500 documents).
  4. Our results are the top 50 documents from the previous step.

Page No. 100 of search results:

  1. The node which receives a query (controller) passes it on to 10 shards.
  2. Every shard returns its 5000 best matching documents sorted by relevance.
  3. After receiving responses from every shard, the controller merges the results (50000 documents).
  4. Our results are the documents from the previous step positioned 4901 – 5000.

Assuming that one document is 1KB in size, in the second case it means that ~50MB of data must be sent and processed around the cluster, in order to view 100 results for one user.

It’s not hard to notice, that network traffic and index load increases significantly with each successive result page. That’s why it is not recommended to make the “far” search pages available to the user. If our index is well configured, than the user should find the result he’s interested in on the first search pages, and we’ll protect ourselves from unnecessary load of our cluster. To prove this rule, check, up to what number of search result pages do the most popular web search engines allow viewing.

What’s also interesting is the observation of browser response time for successive search result pages. For example, below you can find response times for individual search result pages in Google Search (the search term was “search engine”):

| Search result page (10 documents per page) | Response time |
|——————————————–|—————|
| 1 | 250ms |
| 10 | 290ms |
| 20 | 350ms |
| 30 | 380ms |
| 38 (last one available) | |

In the next part, I will look closer into the problems regarding document indexing.

Artigos relacionados

Ilustração de uma aplicação de cuidados de saúde para smartphone com um ícone de coração e um gráfico de saúde em ascensão, com o logótipo The Codest, representando soluções digitais de saúde e HealthTech.
Desenvolvimento de software

Softwares para o setor de saúde: Tipos, casos de uso

As ferramentas em que as organizações de cuidados de saúde confiam atualmente não se assemelham em nada às fichas de papel de há décadas atrás. O software de cuidados de saúde apoia agora os sistemas de saúde, os cuidados aos doentes e a prestação de cuidados de saúde modernos em...

OCODEST
Ilustração abstrata de um gráfico de barras em declínio com uma seta ascendente e uma moeda de ouro que simboliza a eficiência ou a poupança de custos. O logótipo The Codest aparece no canto superior esquerdo com o slogan "In Code We Trust" sobre um fundo cinzento claro
Desenvolvimento de software

Como dimensionar a sua equipa de desenvolvimento sem perder a qualidade do produto

Aumentar a sua equipa de desenvolvimento? Saiba como crescer sem sacrificar a qualidade do produto. Este guia cobre sinais de que é hora de escalar, estrutura da equipe, contratação, liderança e ferramentas - além de como o The Codest pode...

OCODEST
Desenvolvimento de software

Construir aplicações Web preparadas para o futuro: ideias da equipa de especialistas do The Codest

Descubra como o The Codest se destaca na criação de aplicações web escaláveis e interactivas com tecnologias de ponta, proporcionando experiências de utilizador perfeitas em todas as plataformas. Saiba como a nossa experiência impulsiona a transformação digital e o negócio...

OCODEST
Desenvolvimento de software

As 10 principais empresas de desenvolvimento de software sediadas na Letónia

Saiba mais sobre as principais empresas de desenvolvimento de software da Letónia e as suas soluções inovadoras no nosso último artigo. Descubra como estes líderes tecnológicos podem ajudar a elevar o seu negócio.

thecodest
Soluções para empresas e escalas

Fundamentos do desenvolvimento de software Java: Um Guia para Terceirizar com Sucesso

Explore este guia essencial sobre o desenvolvimento de software Java outsourcing com sucesso para aumentar a eficiência, aceder a conhecimentos especializados e impulsionar o sucesso do projeto com The Codest.

thecodest

Subscreva a nossa base de conhecimentos e mantenha-se atualizado sobre os conhecimentos do sector das TI.

    Sobre nós

    The Codest - Empresa internacional de desenvolvimento de software com centros tecnológicos na Polónia.

    Reino Unido - Sede

    • Office 303B, 182-184 High Street North E6 2JA
      Londres, Inglaterra

    Polónia - Pólos tecnológicos locais

    • Parque de escritórios Fabryczna, Aleja
      Pokoju 18, 31-564 Cracóvia
    • Embaixada do Cérebro, Konstruktorska
      11, 02-673 Varsóvia, Polónia

      The Codest

    • Início
    • Sobre nós
    • Serviços
    • Case Studies
    • Saber como
    • Carreiras
    • Dicionário

      Serviços

    • Aconselhamento
    • Desenvolvimento de software
    • Desenvolvimento backend
    • Desenvolvimento de front-end
    • Staff Augmentation
    • Programadores de back-end
    • Engenheiros de nuvem
    • Engenheiros de dados
    • Outros
    • Engenheiros de GQ

      Recursos

    • Factos e mitos sobre a cooperação com um parceiro externo de desenvolvimento de software
    • Dos EUA para a Europa: Porque é que as empresas americanas decidem mudar-se para a Europa?
    • Comparação dos centros de desenvolvimento da Tech Offshore: Tech Offshore Europa (Polónia), ASEAN (Filipinas), Eurásia (Turquia)
    • Quais são os principais desafios dos CTOs e dos CIOs?
    • The Codest
    • The Codest
    • The Codest
    • Privacy policy
    • Website terms of use

    Direitos de autor © 2026 por The Codest. Todos os direitos reservados.

    pt_PTPortuguese
    en_USEnglish de_DEGerman sv_SESwedish da_DKDanish nb_NONorwegian fiFinnish fr_FRFrench pl_PLPolish arArabic it_ITItalian jaJapanese es_ESSpanish nl_NLDutch etEstonian elGreek cs_CZCzech pt_PTPortuguese