Exploring the mystery of the database from the perspective of a programmer, taking MySQL as an example

Basic Principles of Database

My understanding of DB

First: The composition of the database: storage + instance

Needless to say, data certainly needs to be stored; storage is not enough. Obviously, it is necessary to provide programs to encapsulate storage operations and provide external APIs for adding, deleting, modifying, and checking, that is, instances.

One storage can correspond to multiple instances, which will improve the load capacity and high availability of the storage; multiple storages can be distributed in different computer rooms and regions, and disaster recovery will be achieved.

Second: Press Block or Page to read data

Think about it with your thighs and know that the database cannot read data in rows (Why?? ^_^). In fact, databases, such as Oracle/MySQL, are based on physical blocks (Block or Page, I will not distinguish them here) of a fixed size (such as 16K) to implement scheduling and management. We need to know that Block is the concept of a database, how does it correspond to the file system? Obviously, it is necessary to point out "where is the address of this block". When the address is found, reading data of a fixed size is equivalent to completing the block reading.

The database is very clever. It will not only read the blocks that need to be read, it will also read and load all nearby blocks into the memory for us. In fact, this is to reduce the number of IOs and increase the hit rate. In fact, the nearby blocks of a block are also hot data, this kind of processing is necessary!

Third: Disk IO is the performance bottleneck of the database

There is no doubt that the data is on the disk, and disk IO is indispensable. I won’t talk about the process of head rotation, track positioning, and addressing. We are programmers and we can’t control these. However, this process is indeed very time-consuming, and is not an order of magnitude from memory reading, so many ways have emerged to reduce IO and improve database performance.

For example, increase the memory to allow the database to load more data into the memory. Although the memory is good, it cannot be abused. Why do you say that? Assuming that there are 100G data in the database, if they are all loaded into the memory, it means that the database has to manage 100G disk data + 100G memory data. Are you tired? (The database has to deal with the mapping relationship between disk and memory, the synchronization of data, and the cleaning of memory data. If database transactions are involved, it is a series of complex operations...) But what needs to be pointed out here is that in order to speed up memory search, The database generally stores the memory in HASH.

For example, using indexes, compared to memory, indexes are a very cost-effective thing. The following article introduces the principle of MySQL indexes in detail.

For example, using a disk with better performance... (it doesn’t matter to us)

Fourth, put forward some questions to think about:

Why do we say that using delete to delete data in a table is slower than trancate a table?

[One finds and deletes by row, which is more laborious; a Block-based architecture deletes] Why do we say that small tables drive large tables? [Will the small watch drive the big watch faster? What the hell? Are MN and NM the same? Where there are ghosts, there are indexes! 】

Explore the principles behind MySQL indexes

For most application systems, the read-write ratio is 10:1 or even 100:1, and insert/update is difficult to have performance problems. The most difficult one is select, and select optimization is the top priority. , Obviously an index is indispensable!

Speaking of MySQL indexes, we will pop up a lot of these things: BTree index/B+Tree index/Hash index/clustered index/non-clustered index...so many, dizzy!

What exactly is an index and what problem do you want to solve?

It's a commonplace saying that the official website says that MySQL index is a data structure, and the purpose of index is to improve query efficiency.

To put it bluntly, if the index is not used, the number of disk IO is more! What should I do if I want to reduce the number of disk IOs?

We want to filter out the final desired results by continuously narrowing the scope of the data we want to obtain, and control the number of disk IOs for each data search to a small order of magnitude, preferably a constant order of magnitude.

In order to deal with the above problems, the B+Tree index came out!

Hello, B+Tree

In MySQL, different storage engines implement indexes in different ways. Here we will focus on analyzing MyISAM and Innodb.

B+Tree index structure of MyISAM engine

We know that for the MyISAM engine, the data file and the index file are separated. It can also be seen from the figure that after searching through the index, the physical address of the data is obtained, and then the record in the data file can be located according to the address. This method is also called "non-clustered index".

For the Innodb engine, the data file itself is the index file! In layman's terms, MyISAM stores the physical address of the record on the leaf node, while the data content is stored on Innodb. This method is called "clustered index".

Another point to note is that for Innodb, the leaf node in the primary key index stores the data content, while the leaf node of the ordinary index stores the primary key value! That is to say, for Innodb's ordinary index field search, first find the primary key through the B+Tree of the ordinary index, and then search through the B+Tree of the primary key index. From here, you can see that for Innodb, the establishment of the primary key is very important!

For MyISAM, the only difference between the primary key index and the normal index is that the primary key only needs to find a record to stop, while the normal index allows duplication. After finding a record, you need to continue to search. There is no difference in structure, as shown in the figure above. .

In-depth B+Tree

Ask a few questions:

Why does B+Tree put the real data in the leaf nodes instead of the inner nodes?

Why do we say that index fields should be as short as possible, preferably monotonically increasing?

Why does the composite index have the leftmost matching principle?

Range query (>,

Regarding some mathematical theories of B+Tree, let's not play, at least one thing is certain: the amount of data in the data table N=F (the height of the tree h, the number of indexes stored in each block m). In the case of a certain N, the smaller the index field, the larger m will be, which means the smaller h will be! The lower the tree, the faster the search is of course!

If the inner node stores real data, obviously m will become smaller and the tree will become taller.

In practical applications, we should use monotonically increasing fields as the primary key as much as possible. On the one hand, it will not make the data structure of the index larger and reduce the space occupied by the index; on the other hand, it will not split the B+Tree frequently. Make efficiency drop.

For example, for composite indexes (name, age, sex), B+Tree will compare name first to determine the next search direction. If (age, sex) comes suddenly, there is no way to start. This is also in line with common sense. For a book, we say "find the XXX of the chapter and section", and have never heard of "find the XXX of the section"! This is an important feature of the composite index, that is, the leftmost matching feature.

Assuming that there is a composite index (name, age, sex), when we select, we do not follow this order, but sex ='man' and name ='zfz' and age = 27, will we use the index? The database is very smart and will automatically help us adjust when SQL is optimized! But if the first column of the composite index is missing, the database will be powerless.

For the leftmost match, MySQL will always match to the right until it encounters a range query and stops matching. What's the meaning? For example, a composite index (name, age, sex), for name ='zhangfengzhe' and age> 26 and sex ='man', in fact only the name column of the composite index is used.

If you want to use the index, you have to be "clean"

What is "clean"? Just don't let the index participate in the calculation! For example, applying a function to the index may cause the index to become invalid. why?

In fact, don't think about it, B+Tree stores data. To compare, you need to apply functions to all the data, which is obviously too costly.

I want to build an index, look at the degree of distinction

Although the index is good and cheap, don't mess around. Count(distinct col) / count(*) can calculate the discrimination degree of col. Obviously, it is 1 for the primary key. If the degree of discrimination is too low, consider whether it is necessary to establish an index?

Hash index

This is not to analyze the Hash index in depth, but to explain that the idea of ​​Hash is really everywhere! In the memory storage engine of MySQL, there is a hash function. Given a key, the address is calculated through the hash function. Therefore, under normal circumstances, the hash index search will be very fast, O(1) speed. But there are also hash conflicts, which are resolved in the form of a singly linked list like HashMap.

Think about it, does the hash index support range queries?

Obviously it is not supported, it can only be found by a KEY. Just like HashMap, will it be fast to find the key containing "zhangfengzhe"?

SQL optimization artifact: explain

There are many scenarios for SQL optimization, and there are many skills on the Internet, I can't remember it at all!

To completely solve this problem, I think that the only way to properly understand the data structure and principles behind the index. When writing SQL or SQL slow query, we have the basis to analyze, and then use the explain tool to verify, we should It's not a big problem.

The result of the explain query can tell you which indexes are being used, how the tables are scanned, and so on. Here I will demonstrate a Demo.

Data table student:

Note the composite index (age, address)

Matches the leftmost prefix match

Composite index failure

OK, here, the preparation is over, the query is easy, the optimization is not easy, and write and cherish!

Shenzhen Kate Technology Co., Ltd. , https://www.katevape.com