Reindexing all tables after upgrading to PostgreSQL 132021-04-13
I recently upgraded my client ev.energy to PostgreSQL 13. The first feature listed in this version’s release notes is “Space savings and performance gains from de-duplication of B-tree index entries”. I reindexed all tables after the upgrade to take advantage of this deduplication and saw index storage savings of up to 90%.
What is deduplication?
The deduplication reduces storage in low-cardinality indexes - that is, where one value points to many rows.
For example, imagine a table of users, indexed by country. There are many more users than countries, so the same country values point to many rows. Such a B-tree index with duplication could look like this:
With de-duplication, PostgreSQL stores the repeated value only once in the index, using a list format for pointers to the rows:
|Wakanda||Sheserra, M’Junhe, Khukel, Ndazzuta, D’Vahe|
This smaller storage format saves space when the index is both on disk and in memory. The reduced memory cost means we can fit more a larger working set in memory, improving performance.
Reindexing all tables
After upgrading the production database to PostgreSQL 13, I wanted to reindex all tables to convert them to the new deduplicated format.
PostgreSQL allows you to reindex without locking by using its
REINDEX ... CONCURRENTLY syntax.
After a little playing around I created this SQL statement to generate
REINDEX TABLE CONCURRENTLY statements for all tables, in ascending order of size:
Running this gave me results like:
I could then step through and each one statement in turn. Starting with the smallest tables first gave me the confidence that I could spot any problems before committing to larger operations (the largest table was 327GB!).
I ran the
REINDEX statements in TablePlus, where I could hit Command-Plus to run only the statement under the cursor.
I watched the reindexing on
psql by with the
\di+ command to describe all indexes for a table.
This was easy since Django prefixes index names with the table name.
I learned that when reindexing concurrently, PostgreSQL creates new indexes with the suffix
_ccnew before swapping them at the end.
This let me somewhat track the progress, as I could see each new index on a table fill up in turn.
For example, part-way through the reindexing of the
django_session table, I saw this:
I could tell the first
expire_date index had completed since its
_ccnew copy had stopped increasing in size.
pkey index was being filled at this point on since its
_ccnew copy was increasing in size.
After completing the reindex for the table, the
_ccnew names would disappear as PostgreSQL had renamed the new copies on top of the old.
The database shrank from 414GB disk usage to 348GB, a saving of 66GB, or 16%. This is despite little savings in the largest table of 327GB, which is log data with only a primary key index.
The savings on smaller, hotter tables were much more impressive. Some very low cardinality indexes reached over a 90% saving.
One table whose indexes shrank a lot was the Django session table, going from 232MB index storage to 3MB. The session table has a lot of churn, and no low-cardinality indexes, so the savings here were probably due to defragmentation (removing now-empty space between entries) rather than deduplication.
It seems that not only is reindexing a good idea after upgrading to PostgreSQL 13, but it’s also useful to run occasionally to defragment tables.
May your database run ever faster,
🎉 My book Speed Up Your Django Tests is now up to date for Django 3.2. 🎉
Buy now on Gumroad
One summary email a week, no spam, I pinky promise.
Tags: postgresql, python
© 2021 All rights reserved.