How to migrate from Django’s PostgreSQL CI Fields to use a case-insensitive collation

2023-02-23 This migration is a little bit fiddly...

If you upgrade to Django 4.2, you may see system check warnings like:

example.User.email: (fields.W906) django.contrib.postgres.fields.CIEmailField is deprecated. Support for it (except in historical migrations) will be removed in Django 5.1.
    HINT: Use EmailField(db_collation="…") with a case-insensitive non-deterministic collation instead.

In this post we’ll cover why this has changed, and how to perform the necessary migration. You can actually migrate from Django 3.2, and may want to, because the alternative uses more accurate and flexible Unicode rules.

What changed, why

As covered in the miscellaneous release notes, three field classes and a mixin are deprecated:

CICharField
CIEmailField
CITextField
CIText mixin

The CI* fields provide Case-Insensitive versions of their vanilla counterparts. These use the citext module built-in to PostgreSQL. But, since version 12, PostgreSQL discourages using citext, stating:

Consider using nondeterministic collations instead of this module. They can be used for case-insensitive comparisons, accent-insensitive comparisons, and other combinations, and they handle more Unicode special cases correctly.

Essentially, collations are more featureful and correct.

The deprecation in Django 4.2 encourages you to follow PostgreSQL’s advice. You can specify a collation for CharField, TextField, and subclasses like EmailField, with the db_collation argument (added in Django 3.2). So for case-insensitive fields in PostgreSQL, you now need to create a case-insensitive collation and use it.

Django Fellow Mariusz Felisiak suggested this change in January 2022. Then in July that year, he then worked on it, in Ticket #33872. Thank you Mariusz for this tidy-up!

What to do

To deal with this deprecation, you need to migrate each CI*Field to its vanilla variant, with an appropriate case-insensitive collation. All the pieces are in place for this migration if you are using Django 3.2+ with PostgreSQL 12+, so you can make the change well ahead of upgrading to Django 4.2.

1. Add a new collation

As the release notes say, you need a “case-insensitive non-deterministic collation”. Let’s deconstruct that:

A collation is a specification of how to compare and sort strings.
Case-insensitive: upper and lower case letters compare equal, and sort together.
Non-deterministic: a PostgreSQL term meaning that the collation needs to be used for sorting - PostgreSQL cannot rely on sorting by byte values.

Okay, so how do you create such a collation? Here’s an example from the PostgreSQL documentation on non-deterministic collations:

CREATE COLLATION case_insensitive (
    provider = icu, locale = 'und-u-ks-level2', deterministic = false
);

This might be the SQL you want, but let’s unpack it before looking at how to run it in Django migrations.

CREATE COLLATION - creates a new collation within the current schema.
case_insensitive - our chosen name for the collation.
provider = icu - tells PostgreSQL to use the ICU library (International Components for Unicode) for this collation. You will need a version of PostgreSQL that was compiled with ICU, as per the managing collation docs.

It’s normally available in OS packages, Docker images, and managed providers, etc. If you’re cautious, you can check with:
```
example=# SELECT EXISTS(SELECT * FROM pg_collation WHERE collname LIKE '%-x-icu');
 exists
--------
 t
(1 row)
```
und-u-ks-level2 is a language tag with extensions. It may look like gobbledygook at first, but we can systematically deconstruct it using the Unicode standards:
- und - undetermined language, which activates the Unicode “root collation”. This collation sorts symbols first, then alphabetically per script.
  
  You may wish to use language-specific sorting, in which case swap this for a language code, e.g. de for Germany.
- -u- - this specifies that what follows are extra “Extension U” Unicode attributes. These are all drawn from this table.
- ks-level2 - the ks attribute defines the collation strength, here set to “level 2”. Strength level 2 doesn’t include case in comparisons, only letters and accents. The best explanation of the strength levels I found was in the ICU documentation.
  
  You may wish to use level 1 instead, for accent-insensitive comparisons.
Another neat attribute you might want to add (after another -): kn-true - numeric ordering. This will sort e.g. “a-20” before “a-100”.
deterministic = false tells PostgreSQL the collation is non-deterministic, as above.

Okay so that’s the collation creation SQL analyzed. Time to put this statement into a migration. You could use RunSQL, but it’s more Djangoey and reversible to use the CreateCollation operation (added in Django 3.2).

Create an empty migration in your “main” or “core” Django app:

$ ./manage.py makemigrations example --empty --name case_insensitive_collation
Migrations for 'example':
  example/migrations/0005_case_insensitive_collation.py

Then, edit that migration file to include a CreateCollation operation:

from django.contrib.postgres.operations import CreateCollation
from django.db import migrations


class Migration(migrations.Migration):
    dependencies = [
        ("example", "0004_something_or_other"),
    ]

    operations = [
        CreateCollation(
            "case_insensitive",
            provider="icu",
            locale="und-u-ks-level2",
            deterministic=False,
        ),
    ]

Using the sqlmigrate command, you can see the SQL that Django will run for the migration, forwards and backwards:

$ python manage.py sqlmigrate example 0005
BEGIN;
--
-- Create collation case_insensitive
--
CREATE COLLATION "case_insensitive" (locale="und-u-ks-level2", provider="icu", deterministic=false);
COMMIT;

$ python manage.py sqlmigrate example 0005 --backwards
BEGIN;
--
-- Create collation case_insensitive
--
DROP COLLATION "case_insensitive";
COMMIT;

Alright, step one done 😅.

2. Alter fields

Replace all CI*Fields in your models with their vanilla variants, specifying db_collation with your collation name. For example, to update this User class:

 class User(AbstractBaseUser):
-    email = CIEmailField(_("email address"), unique=True)
+    email = models.EmailField(_("email address"), db_collation="case_insensitive", unique=True)
     ...

Then, generate migrations:

$ ./manage.py makemigrations
Migrations for 'users':
  users/migrations/0010_alter_user_email.py
    - Alter field email on user

(If you have models in your “main” / “core” app, you may wish to combine with the above CreateCollation migration.)

Check the new migration with sqlmigrate as well:

$ ./manage.py sqlmigrate users 0010
BEGIN;
--
-- Alter field email on user
--
DROP INDEX IF EXISTS "users_user_email_243f6e77_like";
ALTER TABLE "users_user" ALTER COLUMN "email" TYPE varchar(254) COLLATE "case_insensitive";
COMMIT;

$ ./manage.py sqlmigrate users 0010 --backwards
BEGIN;
--
-- Alter field email on user
--
DROP INDEX IF EXISTS "users_user_email_243f6e77_like";
ALTER TABLE "users_user" ALTER COLUMN "email" TYPE citext;
COMMIT;

All looks as expected, with ALTER COLUMN statements to change the column’s data type.

(The DROP INDEX IF EXISTS statement ensures no “like” index exists. This is an index that Django adds for CharField / TextField classes, to speed up SQL LIKE matches. Setting db_collation makes Django remove any such index, since non-deterministic collations do not support LIKE, or such an index. Since the column previously used the citext type, no such index will exist anyway.)

One caveat: if you’re moving from CICharField or CIEmailField to their vanilla variants, your data might be too long. The CI*Fields all use the citext type, which is unlimited in size, whilst CharField and EmailField use varchar. If your data doesn’t fit, PostgreSQL will error at migration time like:

silvr@a67bfe06cb5c:/opt/silvr$ ./manage.py migrate users 0010
Operations to perform:
  Target specific migration: 0010_alter_user_email, from users
Running migrations:
  Applying users.0010_alter_user_email...Traceback (most recent call last):
  ...
  File "/.../site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.DataError: value too long for type character varying(254)

You can check ahead of time by querying your database:

In [2]: from django.db.models.functions import Length

In [3]: from django.db.models import Max

In [4]: User.objects.aggregate(m=Max(Length("email")))
Out[4]: {'m': 312}

You can mitigate this by specifying a larger max_length, or migrating to TextField instead.

3. Test-ify

With more involved model definitions, it’s best to test to ensure all the pieces are working correctly. Here are a couple tests checking the above User.email field is case-insensitive and unique:

from django.db import IntegrityError
from django.test import TestCase

from users.models import User


class UserTests(TestCase):
    def test_email_case_insensitive_search(self):
        user = User.objects.create(email="Hacker@example.com")
        user2 = User.objects.get(email="hacker@example.com")
        assert user == user2

    def test_email_case_insensitive_unique(self):
        User.objects.create(email="Hacker@example.com")
        msg = 'duplicate key value violates unique constraint "users_user_email_key"'
        with self.assertRaisesMessage(IntegrityError, msg):
            User.objects.create(email="hacker@example.com")

Adapt as necessary.

4. Adjust queries

Some of Django’s lookups, like icontains, map to a SQL LIKE. PostgreSQL does not allow such queries with nondeterministic collations—you will see this error if such a query is attempted:

NotSupportedError: nondeterministic collations are not supported for LIKE

One solution is to use the Collate database function to create a differently-collated annotation for the field, and filter against that:

from django.db.models.functions import Collate
from users.models import User

User.objects.annotate(
    email_deterministic=Collate("email", "und-x-icu"),
).filter(email_deterministic__icontains="example")

und-x-icu is a general purpose, language-agnostic Unicode collation from ICU.

I encountered this within the admin, which uses icontains by default for search fields. So the solution was to add the annotation in get_queryset() and then declare it in search_fields:

...


class UserAdmin(BaseUserAdmin):
    search_fields = ("email_deterministic", ...)

    ...

    def get_queryset(self, request: HttpRequest) -> QuerySet[User]:
        return (
            super()
            .get_queryset(request)
            .annotate(
                email_deterministic=Collate("email", "und-x-icu"),
            )
        )

You may want to check for performance regressions, since such lookups probably need table scans. That said, it’s likely LIKE against citext columns also requires a table scan.

Alright, that is all it takes to migrate

Maybe drop `citext` too

If you’ve migrated all your CI*Fields, you may want to remove the citext extension from your database. This will save a sliver of memory in PostgreSQL, and it will make it a bit harder to add CI*Fields in the future.

Django only has a PostgreSQL CreateExtension operation, so you need to use RunSQL to drop the extension. Here’s an example that you can put into a migration:

migrations.RunSQL(
    sql='DROP EXTENSION IF EXISTS "citext"',
    reverse_sql='CREATE EXTENSION IF NOT EXISTS "citext"',
)

Adding new case-insensitive `db_collation` fields can fail

If you try to directly add a new field with db_collation set to a nondeterministic collation, the migration will fail on Django <4.2 with:

django.db.utils.NotSupportedError: nondeterministic collations are not supported for operator class "varchar_pattern_ops"

Check sqlmigrate for the migration:

$ ./manage.py sqlmigrate users 0010
BEGIN;
--
-- Add field email to user
--
ALTER TABLE "users_user" ADD COLUMN "email" varchar(254) COLLATE "case_insensitive" NULL UNIQUE;
CREATE INDEX "users_user_email_0c800cea_like" ON "users_user" ("email" varchar_pattern_ops);
COMMIT;

The statement that fails is the CREATE INDEX. Django creates such indexes by default for varchar and text fields to speed up LIKE queries. But as we’ve seen, they aren’t supported when using a nondeterministic collation.

Django 4.2 includes a fix to not create this index when db_collation is declared: Ticket #33901. On earlier versions, you can edit your migration to use RunSQL with the CREATE INDEX statement removed:

from django.db import migrations, models


class Migration(migrations.Migration):
    dependencies = [
        ("users", "0009_something_something"),
    ]

    operations = [
        migrations.RunSQL(
            sql="""
            ALTER TABLE "users_user" ADD COLUMN "email" varchar(254) COLLATE "case_insensitive" NULL UNIQUE;
            """,
            reverse_sql="""
            ALTER TABLE "users_user" DROP COLUMN "email" CASCADE;
            """,
            state_operations=[
                migrations.AddField(
                    model_name="user",
                    name="email",
                    field=models.EmailField(
                        db_collation="case_insensitive",
                        max_length=254,
                        null=True,
                        unique=True,
                        verbose_name="email address",
                    ),
                ),
            ],
        ),
    ]

(Grab reverse_sql from sqlmigrate --backwards.)

Fin

I wrote this post whilst dealing with the migration for my client Silvr. Thank you to Anna Bierońska and Pascal Fouque there for reviewing the post and PR. And thanks to the original deprecator Mariusz Felisiak for reviewing the post too.

And in any case: Good luck, good luck, GOOD LUCK, …

—Adam

😸😸😸 Check out my new book on using GitHub effectively, Boost Your GitHub DX! 😸😸😸

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: django, postgresql