Reindexing all tables after upgrading to PostgreSQL 13

2021-04-13 Non-helpful diagram.

I recently upgraded my client ev.energy to PostgreSQL 13. The first feature listed in this version’s release notes is “Space savings and performance gains from de-duplication of B-tree index entries”. I reindexed all tables after the upgrade to take advantage of this deduplication and saw index storage savings of up to 90%.

What is deduplication?

The deduplication reduces storage in low-cardinality indexes - that is, where one value points to many rows.

For example, imagine a table of users, indexed by country. There are many more users than countries, so the same country values point to many rows. Such a B-tree index with duplication could look like this:

Country	User
…
Wakanda	Sheserra
Wakanda	M’Junhe
Wakanda	Khukel
Wakanda	Ndazzuta
Wakanda	D’Vahe
…

With de-duplication, PostgreSQL stores the repeated value only once in the index, using a list format for pointers to the rows:

Country	User
…
Wakanda	Sheserra, M’Junhe, Khukel, Ndazzuta, D’Vahe
…

This smaller storage format saves space when the index is both on disk and in memory. The reduced memory cost means we can fit more a larger working set in memory, improving performance.

Reindexing all tables

After upgrading the production database to PostgreSQL 13, I wanted to reindex all tables to convert them to the new deduplicated format. PostgreSQL allows you to reindex without locking by using its REINDEX ... CONCURRENTLY syntax. After a little playing around I created this SQL statement to generate REINDEX TABLE CONCURRENTLY statements for all tables, in ascending order of size:

SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname = 'public'
  AND C.relkind = 'r'
  AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) ASC;

Running this gave me results like:

REINDEX TABLE CONCURRENTLY core_twofactordeviceprototype /*16 kB*/;
REINDEX TABLE CONCURRENTLY core_tabledatapoint /*24 kB*/;
REINDEX TABLE CONCURRENTLY core_dbcache /*24 kB*/;
REINDEX TABLE CONCURRENTLY core_table /*24 kB*/;
REINDEX TABLE CONCURRENTLY core_team /*24 kB*/;
REINDEX TABLE CONCURRENTLY django_migrations /*32 kB*/;
...

I could then step through and each one statement in turn. Starting with the smallest tables first gave me the confidence that I could spot any problems before committing to larger operations (the largest table was 327GB!).

I ran the REINDEX statements in TablePlus, where I could hit Command-Plus to run only the statement under the cursor.

I watched the reindexing on psql by with the \di+ command to describe all indexes for a table. This was easy since Django prefixes index names with the table name. I learned that when reindexing concurrently, PostgreSQL creates new indexes with the suffix _ccnew before swapping them at the end. This let me somewhat track the progress, as I could see each new index on a table fill up in turn.

For example, part-way through the reindexing of the django_session table, I saw this:

core=> \di+ django_session*
                                                        List of relations
 Schema |                   Name                         | ... | Size  | ...
--------+------------------------------------------------+-----+-------+---
 public | django_session_expire_date_a5c62663            | ... | 38 MB  |
 public | django_session_expire_date_a5c62663_ccnew      | ... | 568 kB |
 public | django_session_pkey                            | ... | 97 MB  |
 public | django_session_pkey_ccnew                      | ... | 16 kB  |
 public | django_session_session_key_c0390e0f_like       | ... | 97 MB  |
 public | django_session_session_key_c0390e0f_like_ccnew | ... | 0 MB   |
(6 rows)

I could tell the first expire_date index had completed since its _ccnew copy had stopped increasing in size. The pkey index was being filled at this point on since its _ccnew copy was increasing in size.

After completing the reindex for the table, the _ccnew names would disappear as PostgreSQL had renamed the new copies on top of the old.

Results

The database shrank from 414GB disk usage to 348GB, a saving of 66GB, or 16%. This is despite little savings in the largest table of 327GB, which is log data with only a primary key index.

The savings on smaller, hotter tables were much more impressive. Some very low cardinality indexes reached over a 90% saving.

One table whose indexes shrank a lot was the Django session table, going from 232MB index storage to 3MB. The session table has a lot of churn, and no low-cardinality indexes, so the savings here were probably due to defragmentation (removing now-empty space between entries) rather than deduplication.

It seems that not only is reindexing a good idea after upgrading to PostgreSQL 13, but it’s also useful to run occasionally to defragment tables.

Fin

May your database run ever faster,

—Adam

😸😸😸 Check out my new book on using GitHub effectively, Boost Your GitHub DX! 😸😸😸

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: postgresql, python