Backporting a Django ORM Feature with Database Instrumentation

2020-07-29 Backwards like this ancient “lizard”

Last week I covered Django’s database instrumentation, and making a wrapper that’s always installed. Here’s a different use case that I encountered last year on a project.

I was helping optimize query patterns for a model - let’s call it Book. An import function used a loop to import new model instances based on a unique title field:

for title in titles:
    Book.objects.get_or_create(title=title, defaults=...)

This function took several seconds as it inserted several hundred records, required for the project to work. This was particularly noticeable at the start of tests.

(Surprise, this is also a test optimization story!)

We can speed up such an import with the QuerySet.bulk_create() method, passing ignore_conflicts=True. This method inserts a list of instances into the database in as few queries as possible (as little as one). The ignore_conflicts=True option skips any instance that conflicts with a unique or primary key constraint:

Book.objects.bulk_create(
    [Book(title=title, ...) for title in titles],
    ignore_conflicts=True,
)

This is a great solution, but unfortunately it wasn’t available. The client’s project was on Django 2.0 at the time and the ignore_conflicts argument to bulk_create() was added in 2.2.

Upgrading Django would be a rather large yak shave. Instead, I decided to backport the behaviour of ignore_conflicts=True.

Normal backporting takes a lot of work:

Instead of doing that, I opted to try backport the behaviour in a different way. I used database instrumentation to edit the SQL generated by the ORM to match what ignore_conflicts=True generates. I implemented this in a context manager:

from django.db import connection


with make_bulk_create_ignore_conflicts(connection):
    Book.objects.bulk_create(
        [Book(title=title, ...) for title in titles],
        ignore_conflicts=True,
    )

The context manager was a little more involved than solely setting up the database instrumentation wrapper:

from contextlib import contextmanager
from unittest import mock

import django


if django.VERSION >= (2, 2):
    raise AssertionError(
        "On Django 2.2+, replace make_bulk_create_ignore_conflicts with use"
        + " of bulk_create's new argument ignore_conflicts=True"
    )


@contextmanager
def make_bulk_create_ignore_conflicts(connection):
    # Temporarily disable the backend feature flag to pretend to Django that
    # PostgreSQL doesn't support "RETURNING". This prevents it fetching the
    # ID's back, which is a bit different to normal bulk_create, but it
    # generates SQL which we can easily append to
    patch_features = mock.patch.object(
        connection.features, "can_return_ids_from_bulk_insert", False
    )

    # Rewrite SQL to add the same SQL suffix on INSERT statements that Django
    # 2.2+ uses for ignore_conflict=True
    # https://github.com/django/django/blob/stable/2.2.x/django/db/models/sql/compiler.py#L1316
    def make_inserts_conflict_do_nothing(execute, sql, params, many, context):
        if sql.startswith("INSERT INTO"):
            sql += " ON CONFLICT DO NOTHING"
        return execute(sql, params, many, context)

    apply_wrapper = connection.execute_wrapper(make_inserts_conflict_do_nothing)

    with patch_features, apply_wrapper:
        yield

Let’s look through the code one step at a time.

The first step is the Django version check. This is important so that after upgrading Django the project switches to the official feature. Without such checks, projects tend to accumulate workarounds. It raises an exception at import time to ensure it’s spotted early on when upgrading, regardless of test coverage.

The second step is the definition of the context manager itself. It uses the contextlib.contextmanager decorator for simplicity. I made the name excessively long so it’s noticeable as doing something unusual.

The third step is the patching of connection.features using unittest’s mock.patch.object. connection.features is part of Django’s undocumented (but mostly stable) database backend API. The attributes on features control SQL generation so the output is suitable for the given database. The patch temporarily disables addition of the RETURNING clause on INSERT queries, to allow easy addition of the ON CONFLICT clause.

Generally, patching global objects is not thread-safe. But it’s okay here because Django creates separate database connection objects per thread.

The fourth step is the database instrumentation wrapper function, make_inserts_conflict_do_nothing(). This wrapper inspects each query on its way to the database. If the SQL is an INSERT query, it appends the correct " ON CONFLICT DO NOTHING" suffix, as per the PostgreSQL INSERT documentation.

The fifth step is the with statement that applies the patch and adds the database instrumentation temporarily. yield then pauses the @contextmanager function until the user of our context manager exits its with statement.

Fin

This was a fun exercise in combining several Django features. I’ve since upgraded the client project to Django 2.2, which triggered the assertion message and allowed me to remove of this backport. But it worked well whilst it lasted.

Thanks to Tom Forbes for the initial implementation of bulk_create(ignore_conflicts=True) back in Ticket #28668. I hope this helps you speed up any data imports, or write ORM backports,

—Adam


Working on a Django project? Check out my book Speed Up Your Django Tests which covers loads of best practices so you can write faster, more accurate tests.


Subscribe via RSS, Twitter, or email:

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: django