Django and the N+1 Queries Problem

2020-09-01 Snakey snake

This is a cross-post from the Scout APM blog, where I occasionally write. I also maintain the Scout Python integration.

The N+1 Queries Problem is a perennial database performance issue. It affects many ORMs and custom SQL code, and Django’s ORM is not immune either.

In this post, we’ll examine what the N+1 Queries Problem looks like in Django, some tools for fixing it, and most importantly some tools for detecting it. Naturally Scout is one of those tools, with its built-in N+1 Insights tool.

We’ll cover:

What Is the N+1 Queries Problem?
Tools to Fix the N+1 Queries Problem
Tools for Finding N+1 Query Problems

What Is the N+1 Queries Problem?

In a nutshell: code that loops over a list of results from one query, and then performs another query per result.

Let’s look at a basic example and a couple of its natural extensions.

N+1 Queries

Say we had this code:

books = Book.objects.order_by("title")
for book in books:
    print(book.title, "by", book.author.name)

This code uses print() for simplicity, but the problem exists in all other ways data might be accessed, such as in templates or views returning JSON responses.

Let’s trace the queries our code executes:

One line 1, we create the Book QuerySet. Because QuerySets are lazy, this doesn’t execute any query.
On line 2, we iterate over the Book QuerySet. This causes the QuerySet to evaluate and fetch its results, which takes one query.
One line 3, we access a couple of attributes from each Book. title is a field on the Book itself, so Django fetched it in the first query. But the author attribute is a foreign key, so Django didn’t fetch it. Therefore Django executes another query on this line to fetch the given Author corresponding to book.author_id. The Author instance then includes its name field.

So we execute one query to fetch all the Books, and N queries to fetch the authors, where N is the number of books. This can become a performance problem because each query involves the overhead of communicating with the database server, so the total runtime can be high.

The problem is sometimes also called The 1+N Queries Problem to match the order of operations. N+1 seems to be a more popular name though, perhaps because it highlights problematic N queries.

For another perspective, we can look at the actual queries our code executes. The first query would look like:

SELECT
  id,
  title,
  author_id,
  ...
FROM book
ORDER BY title

The following N author queries would look like:

SELECT
  id,
  name,
  ...
FROM author
WHERE id = %s

(With %s replaced by the each author ID value.)

The repetition of queries like SELECT ... WHERE id = %s is a key signature of the problem. You can spot it with a tool that logs your application’s SQL queries (more these later).

2N+1 Queries

If we access multiple foreign key attributes in the loop, we’ll increase the number of queries. Say we changed the code to display our authors’ home countries as well:

books = Book.objects.order_by("title")
for book in books:
    print(
        book.title,
        "by",
        book.author.name,
        "from",
        book.author.country.name,
    )

The access to author.country foreign key executes another query per loop. So we now have two times the N queries we had before.

NM+N+1 Queries

If we nest loops, we can execute multiple queries per N query, compounding the problem.

Imagine our models changed to allow multiple authors per book. We would adjust our code like so:

books = Book.objects.order_by("title")
for book in books:
    print(book.title, "by: ", end="")
    names = []
    for author in book.authors.all():
        names.append(f"{author.name} from {author.country.name}")
    print(", ".join(names))

Line 5 executes each N query, fetching the list of authors per book. Line 6 then accesses the country foreign key per author. If there are an average of M authors per book, there will be N times M country queries.

Often either N or M is small, so the total number of queries will be fairly controlled. But it’s possible to see really large numbers of queries if both are bigger. If we added more nested loops, there would be further multiplication of queries.

Often such nested loops will not be so easy to spot. They can be obscured by intermediate layers of function calls or template includes.

Tools to Fix the N+1 Queries Problem

Django provides two QuerySet methods that can turn the N queries back into one query, solving the performance issue. These methods are select_related() (docs) and prefetch_related() (docs).

They work similarly - both fetch the related model alongside the original query. The difference is that select_related() fetches the related instances in the same query, whilst prefetch_related() uses a second query. Let’s look at them in turn, then a bonus tool that can automatically use prefetch_related().

`select_related()`

Taking our first N+1 example, we can use select_related() by adding it to our Book QuerySet:

books = Book.objects.order_by("title").select_related("author")
for book in books:
    print(book.title, "by", book.author.name)

Line 2 now executes a single query that selects both the Books and their related Authors. Line 3’s access of book.author no longer executes a query because the Author instance is already attached.

So we’ve optimized the code to execute only 1 query, instead of N+1.

The SQL with select_related() uses a JOIN to merge the two tables. The query will look something like this:

SELECT
  book.id,
  book.title,
  book.author_id,
  ...,
  author.id,
  author.name,
  ...
FROM
  book
  INNER JOIN author ON (book.author_id = author.id)
ORDER BY book.title

Each row of the results table includes fields from both the book and author tables:

book.id	book.title	book.author_id	author.id	author.name
123	The Hundred and One Dalmatians	678	678	Dodie Smith
234	The Hound of the Baskervilles	789	789	Arthur Conan Doyle
345	The Lost World	789	789	Arthur Conan Doyle

Django loads the data into the correct model classes.

One downside of this approach is that we might fetch duplicate author data. In the above example, the data includes two copies of Author ID 789, Arthur Conan Doyle. If an Author appeared a lot of times, or the model class had some large fields, the total data size would be large, and performance would suffer. In such cases, it’s better to use prefetch_related().

We can also fix our 2N+1 Queries example with select_related(), by declaring the chained relation:

Book.objects.order_by("title").select_related("author", "author__country")

Unfortunately select_related() cannot help us with the nested NM+N+1 queries example. select_related() can only work with one-to-one and one-to-many relationships, whilst that example has a many-to-many relationship between books and authors. Thankfully, prefetch_related() can work with many-to-many relationships.

django-auto-prefetch

Here I’d like to highlight a special project, django-auto-prefetch. Using this for your models changes Django’s behaviour when accessing unfetched foreign keys.

Recall our first example:

books = Book.objects.order_by("title")
for book in books:
    print(book.title, "by", book.author.name)

On line 3 Django fetches the single author relating to the current book. This happens on each of the N iterations of the loop.

django-auto-prefetch changes this behaviour to fetch all the authors relating to all the fetched books, in one query. It essentially assumes that since one book.author was required, they all will be.

This is equivalent to using prefetch_related() in the original Book QuerySet, hence the library’s name. It has the big advantage that we don’t need to know ahead of time which fields are required.

Installation of django-auto-prefetch is slightly involved, as it requires swapping the base class for all your models. But it’s worth it on larger projects.

The library was created by my friend Gordon Wrigley. We hope to merge it into Django at some point, so the behaviour would be more accessible. It might even eradicate the N+1 Queries Problem for Django applications.

Tools for Finding N+1 Query Problems

N+1 Query Problems are the most frequent Django ORM performance problem. It’s worth having some tooling that can help you find them. Let’s look at three tools that can help you.

nplusone

The nplusone package logs warnings for potential N+1 Queries. The package works but seems slightly unmaintained, with no commits for two years.

After installing it, we can see several log messages for our example, interlaced with our print() statements:

Potential n+1 query detected on `Book.author`
The Hound of the Baskervilles by Arthur Conan Doyle
Potential n+1 query detected on `Book.author`
The Hundred and One Dalmatians by Dodie Smith
Potential n+1 query detected on `Book.author`
The Lost World by Arthur Conan Doyle
[30/Jul/2020 06:23:46] "GET / HTTP/1.1" 200 9495

These “Potential n+1 query detected” log messages can help us track down the problem.

nplusone is more focussed than django-debug-toolbar, but is also usable with other ORMs, such as SQL Alchemy. It has many of the same limitations though.

We can use it for all views. But like django-debug-toolbar, it can’t help with management commands, background tasks, and other code paths.

It’s always active, but requires us to manually inspect the logs.

Unfortunately the log message provides only a small amount of information. The log message provides no information about which line of our code is executing the N+1 Queries. We could get stack traces by configuring logging to transform the messages into exceptions, perhaps in development only. Though this may trigger false positives.

We can use nplusone in production. To receive notifications of problematic pages, we could configure our logging provider to alert on the string “Potential n+1 query detected”.

Scout APM

Scout APM automatically detects N+1 Queries in your application, and is suitable for both development and production. It can trace any view, background task, or other code path, and its traces provide rich information for debugging where performance issues occur.

Other APM (Application Performance Monitoring) products can also help with debugging N+1 Queries, but most don’t automatically highlight them.

After following the three step Django installation guide for our example application and loading the problematic view, the N+1 Queries is highlighted on the dashboard:

Scout APM dashboard showing n+1 insights panel

The top graph provides some insights into what our application is currently doing, and is followed by some useful summary statistics.

The “N+1 Insights” tab is what we’re here to look at. This tab lists views in which Scout detected N+1 Query execution. It uses a 150 millisecond threshold so we can focus on problems that really affect users.

If we click on the view name, we’re taken to the relevant trace:

The trace shows a timeline of the view’s execution. The problematic repeated query has the “N+1” tag and shows the full SQL. We can click on its “backtrace” button to reveal the full stack trace, allowing us to pinpoint the problematic code.

Fin

I hope this guide has equipped you to find and solve your N+1 Query Problems,

—Adam

😸😸😸 Check out my new book on using GitHub effectively, Boost Your GitHub DX! 😸😸😸

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: django