The cost of unnecessary surrogate keys in relationship tables – Java, SQL, and jOOQ.

I show You how To Make Huge Profits In A Short Time With Cryptos!

What is a good natural key?

This is a very difficult question for most entities when designing your schema. In rare cases, there appears to be an “obvious” candidate, such as a variety of ISO standards, including:

But even then, there can be exceptions and the worst thing that can happen is a key change. Most database designs are play safe and use surrogate keys rather. No problem with that. But…

Relationship tables

There is one exception where a surrogate key is never really required. These are relationship tables. For example, in the Sakila Database, all relationship tables do not have a surrogate key and instead use their respective foreign keys as a “natural” composite primary key:

Therefore the FILM_ACTOR array, for example, is defined as such:


CREATE TABLE film_actor (
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id)
);

There is really no need to add another column FILM_ACTOR_ID or ID for an individual row of this table, although many ORMs and schemas not defined by ORM will, just for “consistency” reasons (and in a few cases, because they cannot handle compound keys).

Now, the presence or absence of such a surrogate key is usually not very relevant in daily work with this table. If you are using an ORM, it probably won’t make any difference to the client code. If you are using SQL, it definitely is not. You never use this extra column.

But in terms of performance, it could make a huge difference!

Clustered indexes

In many RDBMS, when creating a table, you can choose whether or not to use a “Clustered index” or a “nonclustered index” table layout. The main difference is:

Clustered index

… Is a primary key index that “aggregates” data, which belongs together. In other words:

  • All index column values ​​are contained in the index tree
  • All other column values ​​are contained in index leaf nodes

The advantage of this table layout is that primary key lookups can be much faster because your entire row is in the index, which requires less disk I / O than the nonclustered index. . for primary key searches. The price for this is slower secondary index searches (eg last name search). The algorithmic complexities are:

  • O(log N) for primary key searches
  • O(log N) for secondary key searches more O(M log N) for non-secondary key column projections (fairly high price to pay)

… or

  • N is the size of the table
  • M is the number of rows searched for in secondary keys

Using OLTP often benefits from clustered indexes.

Nonclustered index

… Is a primary key index that resides “outside” of the table structure, which is a heap table. In other words:

  • All index column values ​​are contained in the index tree
  • All values ​​in the index column and the other column values ​​are contained in the heap table

The advantage of this table layout is that all searches are also fast whether you are using a primary key search or a secondary key search. There is always an additional and constant heap table lookup from time to time. The algorithmic complexities are:

  • O(log N) for primary key searches more O(1) for non-primary key column projections (a moderate price to pay)
  • O(log N) for secondary key searches more O(M) for non-secondary key column projections (a moderate price to pay)

Using OLAP certainly benefits heap tables.

Default values

  • MySQL’s InnoDB only offers clustered indexes.
  • MySQL from MySQL only offers heap tables.
  • Oracle offers both and by default heap tables
  • PostgreSQL offers both and by default heap tables
  • SQL Server offers both clustered indexes and by default

Note that Oracle calls clustered indexes “Tables organized by index”

Performance

In this article, I check the performance of MySQL because InnoDB of MySQL does not offer to change the table layout. Oddly enough, the issues shown below could not be reproduced on PostgreSQL as reported by the reddit user / u / ForeverAlot. Details here.

With the algorithmic complexities above, we can easily guess what I’m trying to allude here. In the presence of a clustered index, we should avoid expensive secondary key lookups as much as possible. Of course, these searches cannot always be avoided, but if we review the alternate design of these two tables:


CREATE TABLE film_actor_surrogate (
  id int NOT NULL,
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_surrogate_pkey PRIMARY KEY (id)
);

CREATE TABLE film_actor_natural (
  actor_id int NOT NULL REFERENCES actor,
  film_id int NOT NULL REFERENCES film,

  CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id)
);

… we can see that if we use a clustered index here, the clustering will be based on:

  • FILM_ACTOR_SURROGATE.ID, which is a very unnecessary clustering
  • (FILM_ACTOR_NATURAL.ACTOR_ID, FILM_ACTOR_NATURAL.FILM_ID), which is very useful clustering

In the latter case, whenever we search for an actor’s movies, we can use the clustering index as a coverage index, it doesn’t matter if we are projecting something extra from this table or not.

In the first case, we have to rely on an additional secondary key index that contains (ACTOR_ID, FILM_ID), and it is likely that the secondary index does not cover if we have additional projections.

Surrogate key clustering is really unnecessary, because we never use the table in this way.

Does it matter?

We can easily design a benchmark for this case. You can find the full reference code here on GitHub, to validate the results on your environment. The benchmark uses this database design:


create table parent_1 (id int not null primary key);
create table parent_2 (id int not null primary key);

create table child_surrogate (
  id int auto_increment, 
  parent_1_id int not null references parent_1, 
  parent_2_id int not null references parent_2, 
  payload_1 int, 
  payload_2 int, 
  primary key (id), 
  unique (parent_1_id, parent_2_id)
) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */
;

create table child_natural (
  parent_1_id int not null references parent_1, 
  parent_2_id int not null references parent_2, 
  payload_1 int, 
  payload_2 int, 
  primary key (parent_1_id, parent_2_id)
) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */
;

Unlike the Sakila Database, we now add a “payload” to the relationship table, which is not unlikely. Recent versions of MySQL will default to InnoDB, which only supports a clustered index layout. You can uncomment the ENGINE storage clause to see how it would work with MyISAM, which only supports heap tables.

The benchmark adds:

  • 10,000 lines in PARENT_1
  • 100 lines in PARENT_2
  • 1,000,000 lines in both CHILD tables (just a cross join of the above)

And then it runs 5 iterations of 10,000 repetitions of the next two queries, following our standard SQL reference technique:


-- Query 1
SELECT c.payload_1 + c.payload_2 AS a 
FROM parent_1 AS p1 
JOIN child_surrogate AS c ON p1.id = c.parent_1_id 
WHERE p1.id = 4;

-- Query 2
SELECT c.payload_1 + c.payload_2 AS a 
FROM parent_1 AS p1 
JOIN child_natural AS c ON p1.id = c.parent_1_id 
WHERE p1.id = 4;

Notice that MySQL does not implement join elimination, otherwise, the unnecessary join to PARENT_1 would be eliminated. The benchmark results are very clear:

Using InnoDB (clustered indexes)


Run 0, Statement 1 : 3104
Run 0, Statement 2 : 1910
Run 1, Statement 1 : 3097
Run 1, Statement 2 : 1905
Run 2, Statement 1 : 3045
Run 2, Statement 2 : 2276
Run 3, Statement 1 : 3589
Run 3, Statement 2 : 1910
Run 4, Statement 1 : 2961
Run 4, Statement 2 : 1897

Using MyISAM (heap tables)


Run 0, Statement 1 : 3473
Run 0, Statement 2 : 3288
Run 1, Statement 1 : 3328
Run 1, Statement 2 : 3341
Run 2, Statement 1 : 3674
Run 2, Statement 2 : 3307
Run 3, Statement 1 : 3373
Run 3, Statement 2 : 3275
Run 4, Statement 1 : 3298
Run 4, Statement 2 : 3322

You should not read this as a comparison between InnoDB and MyISAM in general, but as a comparison of different table structures within the limits of even engine. Obviously, the additional search complexity of the poorly clustered index in CHILD_SURROGATE causes a 50% slower query execution on this type of query, without earning anything.

In the case of the heap table, the additional surrogate key column had no significant effect.

Again, the full benchmark can be found here on GitHub, if you want to repeat it..

Conclusion

Not everyone agrees on which is generally better: clustered or nonclustered indexes. Not everyone agrees on the usefulness of surrogate keys on all board. These are two fairly stubborn discussions.

But this article made it clear that on relationship tables, which have a very clear candidate key, namely the set of outgoing foreign keys that defines the many-to-many relationship, the surrogate key not only does not add value, but it actively harms your performance on a set of queries when your table uses a clustered index.

InnoDB and SQL Server from MySQL use clustered indexes by default, so if you are using one of these RDBMS, check if you have significant room for improvement by removing your surrogate keys.



Source link

Leave a Reply

Your email address will not be published.