Hello friends! There are people with different levels of experience contributing to our forums – and that’s great! But right now, I’m looking for true Ruby titans!
Elasticsearch is a search engine that is based on a trusted and mature library – Apache Lucene. Huge activity in git project repository and the implementation in such projects as GitHub, SoundCloud, Stack Overflow and LinkedIn bear testimony to its great popularity. The part “Elastic” says it all about the nature of the system, whose capabilities are enormous: from a simple file search on a small scale, through knowledge discovery, to big data analysis in real time.What makes Elastic a more powerful than the competition is the set of default configurations and behaviors, which allow to create a cluster and start adding documents to the index in a couple of minutes. Elastic will configure a cluster for you, will define an index and define the types of fields for the first document obtained, and when you add another server, it will automatically deal with dividing index data between servers.Unfortunately, the above mentioned automation makes it unclear to us what the default settings implicate, and it often turns out to be misleading. This article starts a series where I will be dealing with the most popular gotchas, which you might encounter during the process of Elastic-based app creation.
The number of shards cannot be changed
Let’s index the first document using index API:
$ curl -XPUT 'http://localhost:9200/myindex/employee/1' -d '{
"first_name" : "Jane",
"last_name" : "Smith",
"steet_number": 12
}'
In this moment, Elastic creates an index for us, titled myindex. What is not visible here is the number of shards assigned to the index. Shards can be understood as individual processes responsible for indexing, storing and searching of some part of documents of a whole index. During the process of document indexing, elastic decides in which shard a document should be found. That is based on the following formula:
shard = hash(document_id) % number_of_primary_shards
It is now clear that the number of primary shards cannot be changed for an index that contains documents. So, before indexing the first document, always create an index manually, giving the number of shards, which you think is sufficient for a volume of indexed data:
$ curl -XPUT 'http://localhost:9200/myindex/' -d '{
"settings" : {
"number_of_shards" : 10
}
}'
Default value for number_of_shards
is 5. This means that the index can be scaled to up to 5 servers, which collect data during indexation. For the production environment, the value of shards should be set depending on the expected frequency of indexation and the size of documents. For development and testing environments, I recommend setting the value to 1 – why so? It will be explained in the next paragraph of this article.
Sorting the text search results with a relatively small number of documents
When we search for a document with a phrase:
$ curl -XGET 'http://localhost:9200/myindex/my_type/_search' -d
'{
"query": {
"match": {
"title": "The quick brown fox"
}
}
}'
Elastic processes text search in few steps, simply speaking:
- phrase from request is converted into the same identical form as the document was indexed in, in our case it will be set of terms:
[“quick”, “brown”, “fox”]
(“the” is removed because it’s insignificant),
- the index is being browsed to search the documents that contain at least one of the searched words,
- every document that is a match, is evaluated in terms of being relevant to the search phrase,
- the results are sorted by the calculated relevance and the first page of results is returned to the user.
In the third step, the following values (among others) are taken into account:
- how many words from the search phrase are in the document
- how often a given word occurs in a document (TF – term frequency)
- whether and how often the matching words occur in other documents (IDF – inverse document frequency) – the more popular the word in other documents, the less significant
- how long is the document
The functioning of IDF is important to us. Elastic for performance reasons does not calculate this value regarding every document in the index – instead, every shard (index worker) calculates its local IDF and uses it for sorting. Therefore, during the index search with low number of documents we may obtain substantially different results depending on the number of shards in an index and document distribution.
Let’s imagine that we have 2 shards in an index; in the first one there are 8 documents indexed with the word “fox”, and in the second one only 2 documents with the same word. As a result, the word “fox” will differ significantly in both shards, and this may produce incorrect results. Therefore, an index consisting of only one primary shard should be created for development purposes:
$ curl -XPUT 'http://localhost:9200/myindex/' -d
'{"settings" : { "number_of_shards" : 1 } }'
Viewing the results of “far” search pages kills your cluster
As I’ve written before in previous paragraphs, documents in an index are shared between totally individual index processes – shards. Every process is completely independent and deals only with the documents, which are assigned to it.
When we search an index with millions of documents and wait to obtain top 10 results, every shard must return its 10 best-matched results to the cluster’s node, which initiated the search. Then the responses from every shard are joined together and the top 10 search results are chosen (within the whole index). Such approach allows to efficiently distribute the search process between many servers.
Let’s imagine that our app allows viewing 50 results per page, without the restrictions regarding the number of pages that can be viewed by a user. Remember that our index consists of 10 primary shards (1 per server).
Let’s see how the acquiring of search results will look like for the 1st and the 100th page:
Page No. 1 of search results:
- The node which receives a query (controller) passes it on to 10 shards.
- Every shard returns its 50 best matching documents sorted by relevance.
- After the responses has been received from every shard, the controller merges the results (500 documents).
- Our results are the top 50 documents from the previous step.
Page No. 100 of search results:
- The node which receives a query (controller) passes it on to 10 shards.
- Every shard returns its 5000 best matching documents sorted by relevance.
- After receiving responses from every shard, the controller merges the results (50000 documents).
- Our results are the documents from the previous step positioned 4901 – 5000.
Assuming that one document is 1KB in size, in the second case it means that ~50MB of data must be sent and processed around the cluster, in order to view 100 results for one user.
It’s not hard to notice, that network traffic and index load increases significantly with each successive result page. That’s why it is not recommended to make the “far” search pages available to the user. If our index is well configured, than the user should find the result he’s interested in on the first search pages, and we’ll protect ourselves from unnecessary load of our cluster. To prove this rule, check, up to what number of search result pages do the most popular web search engines allow viewing.
What’s also interesting is the observation of browser response time for successive search result pages. For example, below you can find response times for individual search result pages in Google Search (the search term was “search engine”):
| Search result page (10 documents per page) | Response time |
|——————————————–|—————|
| 1 | 250ms |
| 10 | 290ms |
| 20 | 350ms |
| 30 | 380ms |
| 38 (last one available) | |
In the next part, I will look closer into the problems regarding document indexing.