Reindexing all tables after upgrading to PostgreSQL 13
I recently upgraded my client ev.energy to PostgreSQL 13. The first feature listed in this version’s release notes is “Space savings and performance gains from de-duplication of B-tree index entries”. I reindexed all tables after the upgrade to take advantage of this deduplication and saw index storage savings of up to 90%.
What is deduplication?
The deduplication reduces storage in low-cardinality indexes - that is, where one value points to many rows.
For example, imagine a table of users, indexed by country. There are many more users than countries, so the same country values point to many rows. Such a B-tree index with duplication could look like this:
With de-duplication, PostgreSQL stores the repeated value only once in the index, using a list format for pointers to the rows:
|Wakanda||Sheserra, M’Junhe, Khukel, Ndazzuta, D’Vahe|
This smaller storage format saves space when the index is both on disk and in memory. The reduced memory cost means we can fit more a larger working set in memory, improving performance.
Reindexing all tables
After upgrading the production database to PostgreSQL 13, I wanted to reindex all tables to convert them to the new deduplicated format. PostgreSQL allows you to reindex without locking by using its
REINDEX ... CONCURRENTLY syntax. After a little playing around I created this SQL statement to generate
REINDEX TABLE CONCURRENTLY statements for all tables, in ascending order of size:
SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;' FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) WHERE nspname = 'public' AND C.relkind = 'r' AND nspname !~ '^pg_toast' ORDER BY pg_total_relation_size(C.oid) ASC;
Running this gave me results like:
REINDEX TABLE CONCURRENTLY core_twofactordeviceprototype /*16 kB*/; REINDEX TABLE CONCURRENTLY core_tabledatapoint /*24 kB*/; REINDEX TABLE CONCURRENTLY core_dbcache /*24 kB*/; REINDEX TABLE CONCURRENTLY core_table /*24 kB*/; REINDEX TABLE CONCURRENTLY core_team /*24 kB*/; REINDEX TABLE CONCURRENTLY django_migrations /*32 kB*/; ...
I could then step through and each one statement in turn. Starting with the smallest tables first gave me the confidence that I could spot any problems before committing to larger operations (the largest table was 327GB!).
I ran the
REINDEX statements in TablePlus, where I could hit Command-Plus to run only the statement under the cursor.
I watched the reindexing on
psql by with the
\di+ command to describe all indexes for a table. This was easy since Django prefixes index names with the table name. I learned that when reindexing concurrently, PostgreSQL creates new indexes with the suffix
_ccnew before swapping them at the end. This let me somewhat track the progress, as I could see each new index on a table fill up in turn.
For example, part-way through the reindexing of the
django_session table, I saw this:
core=> \di+ django_session* List of relations Schema | Name | ... | Size | ... --------+------------------------------------------------+-----+-------+--- public | django_session_expire_date_a5c62663 | ... | 38 MB | public | django_session_expire_date_a5c62663_ccnew | ... | 568 kB | public | django_session_pkey | ... | 97 MB | public | django_session_pkey_ccnew | ... | 16 kB | public | django_session_session_key_c0390e0f_like | ... | 97 MB | public | django_session_session_key_c0390e0f_like_ccnew | ... | 0 MB | (6 rows)
I could tell the first
expire_date index had completed since its
_ccnew copy had stopped increasing in size. The
pkey index was being filled at this point on since its
_ccnew copy was increasing in size.
After completing the reindex for the table, the
_ccnew names would disappear as PostgreSQL had renamed the new copies on top of the old.
The database shrank from 414GB disk usage to 348GB, a saving of 66GB, or 16%. This is despite little savings in the largest table of 327GB, which is log data with only a primary key index.
The savings on smaller, hotter tables were much more impressive. Some very low cardinality indexes reached over a 90% saving.
One table whose indexes shrank a lot was the Django session table, going from 232MB index storage to 3MB. The session table has a lot of churn, and no low-cardinality indexes, so the savings here were probably due to defragmentation (removing now-empty space between entries) rather than deduplication.
It seems that not only is reindexing a good idea after upgrading to PostgreSQL 13, but it’s also useful to run occasionally to defragment tables.
Make your development more pleasant with Boost Your Django DX.
One summary email a week, no spam, I pinky promise.
Tags: postgresql, python