When working with the Ruby on Rails framework, we usually deal with relational databases such as MySQL or PostgreSQL. When defining migrations using the Active Record Migrations, we come across the so-called indexes, but beginners often do not quite understand indexes and what benefits they bring.
When working with the Ruby on Rails framework, we usually deal with relational databases such as MySQL or PostgreSQL. When defining migrations using the Active Record Migrations, we come across the so-called indexes, but beginners often do not quite understand indexes and what benefits they bring.
In this post, I would like to explain what indexes are, what they are used for and present some good practices on how to employ them.
Database
There are many database engines, and one of the most popular ones are the previously mentioned MySQL, PostgreSQL, Oracle or Microsoft SQL Server. They are all relational databases, which means that all pieces of data are related to each other and stored in tables. Each table row is called a record, and each has its own unique identifier (id). You can check the ranking of the most popular database engines at https://db-engines.com/en/ranking. You will also find some non-relational databases there, such as MongoDB.
Creating an Index
The tables in our databases can have from just a few to several dozen – in extreme cases, up to several hundred – columns. Keep in mind that each table can have an unlimited number of rows. This number does not result directly from the structure of the database and we should always assume that the number of records will increase successively and, as a consequence, our database will grow. Initial assumptions and queries written in existing applications may be great for a small or medium number of records, but over time, when more data arrives, the application’s communication with the database ceases to be efficient.
The programmer’s role is to write queries to retrieve some data from the table or tables, but the optimal way of processing the query depends on the database engine. Remember that database engines load data from the disk into the memory and then scan it. This means that if many users perform complex operations at the same time, several of them will have to wait their turn because of the lack of resources to implement their searches. That is why relevant indexes are so important.
Wiki: Index – a data structure that increases the speed of performing search operations on a table.
For each index, we need to define keys (for one or many columns) that will be used to search for records in the table. The data in the index will be sorted with the key that has been previously defined, which will significantly speed up the search for data in the table. The simplest example from everyday life is a phone book in which people are sorted by name and surname. It can be said that our index in this case will be the first and last name.
How do you choose the best index key? It is not difficult – just remember a few rules. Create an index based on columns that:
– will often be used in our inquiries (WHERE),
– in combination with each other give a unique value (i.e. one that will indicate exactly one row),
– will be used as so-called connecting columns (JOIN),
– give the most selective keys, i.e. those that return the smallest number of lines when writing a query.
If we already know which keys will be optimal for our table, we can also ask ourselves how many indexes we need. In this case, it is best to know the queries that will refer to our table already at the design stage.
Let’s create indexes for specific queries that will appear, but do not write them for each column. Indexes, like tables, need to be stored somewhere, so when we create tables with an index for each column, we must take into account that the amount of space used can increase significantly.
Create Unique Index
Another issue that we need to think about is uniqueness. It is worth spending the extra five minutes on thinking whether our index is really unique. In this way, we tell the query optimizer that it does not have to expect duplicates on the query. For example, email addresses:
frozenstringliteral: true
class CreateUsers < ActiveRecord::Migration[6.0]
def change
createtable :users do |t|
t.string :email, null: false
end
addindex :users, :email, unique: true
end
end
On the example of the PostgreSQL engine, I will show the difference in query speed on the email column with a unique index and without an index.
1. You can use sample code snippets on your own database to be able to test the example below. First, let’s create an empty table with one column:
CREATE TABLE users (
email varchar
);
2. Let’s generate 10,000 records for the test:
DO $
BEGIN FOR i IN 1..10000 LOOP
INSERT INTO users values((select 'user' || i || '@example.com'));
END LOOP; END;
$;
We will use EXPLAIN ANALYZE to check how fast our query will be processed when we want to find a specific user in the database.
EXPLAIN ANALYZE SELECT email FROM users WHERE email = 'user890example.com';
Our query forced iteration around the entire table in search of the record that interests us.
This process is called sequential scanning. In this case, reading the entire table and filtering out particular rows is the best way to do the job.
PostgreSQL will filter out the unneeded lines and simply return those that interest us. This is really the best thing to do in this instance. Sequential scanning is not always bad, there are cases where sequential scanning is ideal.
4. Now is the time to check the query already made on the table that has INDEX UNIQUE. Let’s set the index and execute the query.
EATE UNIQUE INDEX index_email on users(email);
EXPLAIN ANALYZE SELECT email FROM users WHERE email = 'user890example.com';
This time PostgreSQL took advantage of index scanning because all the columns needed are already in the index.
Selecting only a few lines will be very efficient when using the index. However, if more data is selected, scanning the index and the table will be too time-consuming.
Summary
As you can see, the execution time for a query on a column with an index is much shorter (in the example shown it is a decrease from 1.267 ms to 0.111 ms, so as much as 91.24%!). The most important difference is the way PostgreSQL searches for the record that interests us. In the first case, the database engine had to search the entire table for the record we were needed. In the second, however, the index structure is sorted and unique, hence the engine knew where the record was located, which significantly accelerated the time of the query processing.
In the case of large databases and very complex queries, correctly set indexes can significantly accelerate the work of your application without the need to increase the speed of the machine on which you search the database.
It is worth remembering that creating indexes on each column is not a good practice. Established indexes will speed up the work of the optimizer when searching for data of interest, but at the same time slow down inserting new and updating existing ones.