Reindexing all tables after upgrading to PostgreSQL 132021-04-13
I recently upgraded my client ev.energy to PostgreSQL 13. The first feature listed in this version’s release notes is “Space savings and performance gains from de-duplication of B-tree index entries”. I reindexed all tables after the upgrade to take advantage of this deduplication and saw index storage savings of up to 90%.
What is deduplication?
The deduplication reduces storage in low-cardinality indexes - that is, where one value points to many rows.
For example, imagine a table of users, indexed by country. There are many more users than countries, so the same country values point to many rows. Such a B-tree index with duplication could look like this:
With de-duplication, PostgreSQL stores the repeated value only once in the index, using a list format for pointers to the rows:
|Wakanda||Sheserra, M’Junhe, Khukel, Ndazzuta, D’Vahe|
This smaller storage format saves space when the index is both on disk and in memory. The reduced memory cost means we can fit more a larger working set in memory, improving performance.
Reindexing all tables
After upgrading the production database to PostgreSQL 13, I wanted to reindex all tables to convert them to the new deduplicated format.
PostgreSQL allows you to reindex without locking by using its
REINDEX ... CONCURRENTLY syntax.
After a little playing around I created this SQL statement to generate
REINDEX TABLE CONCURRENTLY statements for all tables, in ascending order of size:
Running this gave me results like:
I could then step through and each one statement in turn. Starting with the smallest tables first gave me the confidence that I could spot any problems before committing to larger operations (the largest table was 327GB!).
I ran the
REINDEX statements in TablePlus, where I could hit Command-Plus to run only the statement under the cursor.
I watched the reindexing on
psql by with the
\di+ command to describe all indexes for a table.
This was easy since Django prefixes index names with the table name.
I learned that when reindexing concurrently, PostgreSQL creates new indexes with the suffix
_ccnew before swapping them at the end.
This let me somewhat track the progress, as I could see each new index on a table fill up in turn.
For example, part-way through the reindexing of the
django_session table, I saw this:
I could tell the first
expire_date index had completed since its
_ccnew copy had stopped increasing in size.
pkey index was being filled at this point on since its
_ccnew copy was increasing in size.
After completing the reindex for the table, the
_ccnew names would disappear as PostgreSQL had renamed the new copies on top of the old.
The database shrank from 414GB disk usage to 348GB, a saving of 66GB, or 16%. This is despite little savings in the largest table of 327GB, which is log data with only a primary key index.
The savings on smaller, hotter tables were much more impressive. Some very low cardinality indexes reached over a 90% saving.
One table whose indexes shrank a lot was the Django session table, going from 232MB index storage to 3MB. The session table has a lot of churn, and no low-cardinality indexes, so the savings here were probably due to defragmentation (removing now-empty space between entries) rather than deduplication.
It seems that not only is reindexing a good idea after upgrading to PostgreSQL 13, but it’s also useful to run occasionally to defragment tables.
May your database run ever faster,
Want better tests? Check out my book Speed Up Your Django Tests which teaches you to write faster, more accurate tests.
One summary email a week, no spam, I pinky promise.
Tags: postgresql, python
© 2021 All rights reserved.