Reindexing all tables after upgrading to PostgreSQL 13

2021-04-13 Non-helpful diagram.

I recently upgraded my client ev.energy to PostgreSQL 13. The first feature listed in this version’s release notes is “Space savings and performance gains from de-duplication of B-tree index entries”. I reindexed all tables after the upgrade to take advantage of this deduplication and saw index storage savings of up to 90%.

What is deduplication?

The deduplication reduces storage in low-cardinality indexes - that is, where one value points to many rows.

For example, imagine a table of users, indexed by country. There are many more users than countries, so the same country values point to many rows. Such a B-tree index with duplication could look like this:

Country User
 
Wakanda Sheserra
Wakanda M’Junhe
Wakanda Khukel
Wakanda Ndazzuta
Wakanda D’Vahe
 

With de-duplication, PostgreSQL stores the repeated value only once in the index, using a list format for pointers to the rows:

Country User
 
Wakanda Sheserra, M’Junhe, Khukel, Ndazzuta, D’Vahe
 

This smaller storage format saves space when the index is both on disk and in memory. The reduced memory cost means we can fit more a larger working set in memory, improving performance.

Reindexing all tables

After upgrading the production database to PostgreSQL 13, I wanted to reindex all tables to convert them to the new deduplicated format. PostgreSQL allows you to reindex without locking by using its REINDEX ... CONCURRENTLY syntax. After a little playing around I created this SQL statement to generate REINDEX TABLE CONCURRENTLY statements for all tables, in ascending order of size:

SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname = 'public'
  AND C.relkind = 'r'
  AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) ASC;

Running this gave me results like:

REINDEX TABLE CONCURRENTLY core_twofactordeviceprototype /*16 kB*/;
REINDEX TABLE CONCURRENTLY core_tabledatapoint /*24 kB*/;
REINDEX TABLE CONCURRENTLY core_dbcache /*24 kB*/;
REINDEX TABLE CONCURRENTLY core_table /*24 kB*/;
REINDEX TABLE CONCURRENTLY core_team /*24 kB*/;
REINDEX TABLE CONCURRENTLY django_migrations /*32 kB*/;
...

I could then step through and each one statement in turn. Starting with the smallest tables first gave me the confidence that I could spot any problems before committing to larger operations (the largest table was 327GB!).

I ran the REINDEX statements in TablePlus, where I could hit Command-Plus to run only the statement under the cursor.

I watched the reindexing on psql by with the \di+ command to describe all indexes for a table. This was easy since Django prefixes index names with the table name. I learned that when reindexing concurrently, PostgreSQL creates new indexes with the suffix _ccnew before swapping them at the end. This let me somewhat track the progress, as I could see each new index on a table fill up in turn.

For example, part-way through the reindexing of the django_session table, I saw this:

core=> \di+ django_session*
                                                        List of relations
 Schema |                   Name                         | ... | Size  | ...
--------+------------------------------------------------+-----+-------+---
 public | django_session_expire_date_a5c62663            | ... | 38 MB  |
 public | django_session_expire_date_a5c62663_ccnew      | ... | 568 kB |
 public | django_session_pkey                            | ... | 97 MB  |
 public | django_session_pkey_ccnew                      | ... | 16 kB  |
 public | django_session_session_key_c0390e0f_like       | ... | 97 MB  |
 public | django_session_session_key_c0390e0f_like_ccnew | ... | 0 MB   |
(6 rows)

I could tell the first expire_date index had completed since its _ccnew copy had stopped increasing in size. The pkey index was being filled at this point on since its _ccnew copy was increasing in size.

After completing the reindex for the table, the _ccnew names would disappear as PostgreSQL had renamed the new copies on top of the old.

Results

The database shrank from 414GB disk usage to 348GB, a saving of 66GB, or 16%. This is despite little savings in the largest table of 327GB, which is log data with only a primary key index.

The savings on smaller, hotter tables were much more impressive. Some very low cardinality indexes reached over a 90% saving.

One table whose indexes shrank a lot was the Django session table, going from 232MB index storage to 3MB. The session table has a lot of churn, and no low-cardinality indexes, so the savings here were probably due to defragmentation (removing now-empty space between entries) rather than deduplication.

It seems that not only is reindexing a good idea after upgrading to PostgreSQL 13, but it’s also useful to run occasionally to defragment tables.

Fin

May your database run ever faster,

—Adam


🎉 My book Speed Up Your Django Tests is now up to date for Django 3.2. 🎉
Buy now on Gumroad


Subscribe via RSS, Twitter, or email:

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: postgresql, python