미래 지향적인 웹 앱 구축: The Codest의 전문가 팀이 제공하는 인사이트
The Codest가 최첨단 기술로 확장 가능한 대화형 웹 애플리케이션을 제작하고 모든 플랫폼에서 원활한 사용자 경험을 제공하는 데 탁월한 성능을 발휘하는 방법을 알아보세요. Adobe의 전문성이 어떻게 디지털 혁신과 비즈니스를 촉진하는지 알아보세요...
Hello friends! There are people with different levels of experience contributing to our forums – and that’s great! But right now, I’m looking for true Ruby titans!
Elasticsearch is a search engine that is based on a trusted and mature library – Apache Lucene. Huge activity in git 프로젝트 repository and the implementation in such projects as GitHub, SoundCloud, Stack Overflow and LinkedIn bear testimony to its great popularity. The part “Elastic” says it all about the nature of the system, whose capabilities are enormous: from a simple file search on a small scale, through knowledge discovery, to big data analysis in real time.What makes Elastic a more powerful than the competition is the set of default configurations and behaviors, which allow to create a cluster and start adding documents to the index in a couple of minutes. Elastic will configure a cluster for you, will define an index and define the types of fields for the first document obtained, and when you add another server, it will automatically deal with dividing index data between servers.Unfortunately, the above mentioned automation makes it unclear to us what the default settings implicate, and it often turns out to be misleading. This article starts a series where I will be dealing with the most popular gotchas, which you might encounter during the process of Elastic-based app creation.
Let’s index the first document using index API:
$ curl -XPUT 'http://localhost:9200/myindex/employee/1' -d '{
"first_name" : "Jane",
"last_name" : "Smith",
"steet_number": 12
}'
In this moment, Elastic creates an index for us, titled myindex. What is not visible here is the number of shards assigned to the index. Shards can be understood as individual processes responsible for indexing, storing and searching of some part of documents of a whole index. During the process of document indexing, elastic decides in which shard a document should be found. That is based on the following formula:
shard = hash(document_id) % number_of_primary_shards
It is now clear that the number of primary shards cannot be changed for an index that contains documents. So, before indexing the first document, always create an index manually, giving the number of shards, which you think is sufficient for a volume of indexed data:
$ curl -XPUT 'http://localhost:9200/myindex/' -d '{
"settings" : {
"number_of_shards" : 10
}
}'
Default value for number_of_shards
is 5. This means that the index can be scaled to up to 5 servers, which collect data during indexation. For the production environment, the value of shards should be set depending on the expected frequency of indexation and the size of documents. For development and testing environments, I recommend setting the value to 1 – why so? It will be explained in the next paragraph of this article.
When we search for a document with a phrase:
$ curl -XGET 'http://localhost:9200/myindex/my_type/_search' -d
'{
"query": {
"match": {
"title": "The quick brown fox"
}
}
}'
Elastic processes text search in few steps, simply speaking:
[“quick”, “brown”, “fox”]
(“the” is removed because it’s insignificant),In the third step, the following values (among others) are taken into account:
The functioning of IDF is important to us. Elastic for performance reasons does not calculate this value regarding every document in the index – instead, every shard (index worker) calculates its local IDF and uses it for sorting. Therefore, during the index search with low number of documents we may obtain substantially different results depending on the number of shards in an index and document distribution.
Let’s imagine that we have 2 shards in an index; in the first one there are 8 documents indexed with the word “fox”, and in the second one only 2 documents with the same word. As a result, the word “fox” will differ significantly in both shards, and this may produce incorrect results. Therefore, an index consisting of only one primary shard should be created for development purposes:
$ curl -XPUT 'http://localhost:9200/myindex/' -d
'{"settings" : { "number_of_shards" : 1 } }'
As I’ve written before in previous paragraphs, documents in an index are shared between totally individual index processes – shards. Every process is completely independent and deals only with the documents, which are assigned to it.
When we search an index with millions of documents and wait to obtain top 10 results, every shard must return its 10 best-matched results to the cluster’s 노드, which initiated the search. Then the responses from every shard are joined together and the top 10 search results are chosen (within the whole index). Such approach allows to efficiently distribute the search process between many servers.
Let’s imagine that our app allows viewing 50 results per page, without the restrictions regarding the number of pages that can be viewed by a user. Remember that our index consists of 10 primary shards (1 per server).
Let’s see how the acquiring of search results will look like for the 1st and the 100th page:
Page No. 1 of search results:
Page No. 100 of search results:
Assuming that one document is 1KB in size, in the second case it means that ~50MB of data must be sent and processed around the cluster, in order to view 100 results for one user.
It’s not hard to notice, that network traffic and index load increases significantly with each successive result page. That’s why it is not recommended to make the “far” search pages available to the user. If our index is well configured, than the user should find the result he’s interested in on the first search pages, and we’ll protect ourselves from unnecessary load of our cluster. To prove this rule, check, up to what number of search result pages do the most popular web search engines allow viewing.
What’s also interesting is the observation of browser response time for successive search result pages. For example, below you can find response times for individual search result pages in Google Search (the search term was “search engine”):
| Search result page (10 documents per page) | Response time |
|——————————————–|—————|
| 1 | 250ms |
| 10 | 290ms |
| 20 | 350ms |
| 30 | 380ms |
| 38 (last one available) | |
In the next part, I will look closer into the problems regarding document indexing.