Reindexing all tables after upgrading to PostgreSQL 13

I recently upgraded my client ev.energy to PostgreSQL 13. The first feature listed in this version’s release notes is “Space savings and performance gains from de-duplication of B-tree index entries”. I reindexed all tables after the upgrade to take advantage of this deduplication and saw index storage savings of up to 90%.
What is deduplication?
The deduplication reduces storage in low-cardinality indexes - that is, where one value points to many rows.
For example, imagine a table of users, indexed by country. There are many more users than countries, so the same country values point to many rows. Such a B-tree index with duplication could look like this:
Country | User |
---|---|
… | |
Wakanda | Sheserra |
Wakanda | M’Junhe |
Wakanda | Khukel |
Wakanda | Ndazzuta |
Wakanda | D’Vahe |
… |
With de-duplication, PostgreSQL stores the repeated value only once in the index, using a list format for pointers to the rows:
Country | User |
---|---|
… | |
Wakanda | Sheserra, M’Junhe, Khukel, Ndazzuta, D’Vahe |
… |
This smaller storage format saves space when the index is both on disk and in memory. The reduced memory cost means we can fit more a larger working set in memory, improving performance.
Reindexing all tables
After upgrading the production database to PostgreSQL 13, I wanted to reindex all tables to convert them to the new deduplicated format. PostgreSQL allows you to reindex without locking by using its REINDEX ... CONCURRENTLY
syntax. After a little playing around I created this SQL statement to generate REINDEX TABLE CONCURRENTLY
statements for all tables, in ascending order of size:
SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname = 'public'
AND C.relkind = 'r'
AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) ASC;
Running this gave me results like:
REINDEX TABLE CONCURRENTLY core_twofactordeviceprototype /*16 kB*/;
REINDEX TABLE CONCURRENTLY core_tabledatapoint /*24 kB*/;
REINDEX TABLE CONCURRENTLY core_dbcache /*24 kB*/;
REINDEX TABLE CONCURRENTLY core_table /*24 kB*/;
REINDEX TABLE CONCURRENTLY core_team /*24 kB*/;
REINDEX TABLE CONCURRENTLY django_migrations /*32 kB*/;
...
I could then step through and each one statement in turn. Starting with the smallest tables first gave me the confidence that I could spot any problems before committing to larger operations (the largest table was 327GB!).
I ran the REINDEX
statements in TablePlus, where I could hit Command-Plus to run only the statement under the cursor.
I watched the reindexing on psql
by with the \di+
command to describe all indexes for a table. This was easy since Django prefixes index names with the table name. I learned that when reindexing concurrently, PostgreSQL creates new indexes with the suffix _ccnew
before swapping them at the end. This let me somewhat track the progress, as I could see each new index on a table fill up in turn.
For example, part-way through the reindexing of the django_session
table, I saw this:
core=> \di+ django_session*
List of relations
Schema | Name | ... | Size | ...
--------+------------------------------------------------+-----+-------+---
public | django_session_expire_date_a5c62663 | ... | 38 MB |
public | django_session_expire_date_a5c62663_ccnew | ... | 568 kB |
public | django_session_pkey | ... | 97 MB |
public | django_session_pkey_ccnew | ... | 16 kB |
public | django_session_session_key_c0390e0f_like | ... | 97 MB |
public | django_session_session_key_c0390e0f_like_ccnew | ... | 0 MB |
(6 rows)
I could tell the first expire_date
index had completed since its _ccnew
copy had stopped increasing in size. The pkey
index was being filled at this point on since its _ccnew
copy was increasing in size.
After completing the reindex for the table, the _ccnew
names would disappear as PostgreSQL had renamed the new copies on top of the old.
Results
The database shrank from 414GB disk usage to 348GB, a saving of 66GB, or 16%. This is despite little savings in the largest table of 327GB, which is log data with only a primary key index.
The savings on smaller, hotter tables were much more impressive. Some very low cardinality indexes reached over a 90% saving.
One table whose indexes shrank a lot was the Django session table, going from 232MB index storage to 3MB. The session table has a lot of churn, and no low-cardinality indexes, so the savings here were probably due to defragmentation (removing now-empty space between entries) rather than deduplication.
It seems that not only is reindexing a good idea after upgrading to PostgreSQL 13, but it’s also useful to run occasionally to defragment tables.
Make your development more pleasant with Boost Your Django DX.
One summary email a week, no spam, I pinky promise.
Related posts:
Tags: postgresql, python