Building a better DatabaseCache for Django on MySQL

2015-05-17 Figure 1 - MySQLCache in the flesh

I recently released version 0.1.10 of my library django-mysql, for which the main new feature was a backend for Django’s cache framework called MySQLCache. This post covers some of the inspiration and improvements it has, as well as a basic benchmark against Django’s built-in DatabaseCache.

TL;DR - it’s better than DatabaseCache, and if you’re using MySQL, please try it out by following the instructions linked at the end.

Why bother?

Django’s cache framework provides a generic API for key-value storage, and gets used for a variety of caching tasks in applications. It ships with multiple backends for popular technologies, including Redis and Memcached, as well as a basic cross-RDBMS DatabaseCache. The DatabaseCache is recommended only for smaller environments, and due to its supporting every RDBMS that Django does, it is not optimized for speed. Redis and Memcached are the most popular cache technologies to use, being specifically designed to do key-value storage; you could even say Django’s cache framework is specifically designed to fit them.

If they work so well, why would anyone bother using DatabaseCache, and why would I care about improving on it? Well, I have a few reasons:

Fewer moving parts

If you can get away with just a database server, and not a database server and a memcached server/cluster, it’s easier to build and maintain your system. Adding more servers and libraries always increases the complexity of your system, and the number of potential problems.
Persistent data
Memcache makes no persistence promises, with new keys overwriting old ones whenever necessary. Redis can run in a persistent mode, but is designed to be in-memory, and thus most apps tend to rely on their RDBMS for actually persistent data. If you have key-value data which your site relies on, it’s much better to keep it in a data store designed for durability and backed by disk than in an ephemeral memcached node.

What kind of persistent data am I talking about? At YPlan we are using the rather excellent Whoosh to power our search feature, which uses a search index consisting of a few MB of files. We rebuild this regularly from the database in a Celery task, and have to store it somewhere for the webservers to pull it from. Memcached is too ephemeral and a filestore like S3 is just another moving part to worry about; by storing it in a MySQLCache table, we have the best of both.

Another example would be counters. While Memcached counters may be mostly reliable and useful for things like rate-limiting, when a node goes offline or a key gets bumped, you lose count. If you are in advertising this could be a particularly costly loss of clicks!
Large[r] data

Memcache has a hard upper limit of 1MB for a value, for which the recommended solution is to compress and split your values - but then you have to deal with lost keys more often. 1MB is just a bit too small for our search index, even after compression. By contrast, Redis has a limit of 512MB, and MySQL can reach 1GB.

MySQL Inspiration

I’ve also seen a few success stories for using MySQL as a key-value store:

Sunny Gleason’s Percona Live presentation “Practical Tips for Using MySQL as a Scalable Key-Value Store” walks through implementing every possible feature you could want, based on real-life experience of avoiding adding Memcached, Redis, Riak servers to systems.
“How FriendFeed uses MySQL to store schema-less data” covers a social network storing data no-relationally in MySQL.
And at YPlan, using raw SQL, we’ve been using a key-value table for recording several GBs of per-customer data that our recommendation system produces, with great success.

Development

When we first set up the Whoosh storage in cache, we used DatabaseCache since we knew we wanted the persistence. Unfortunately when I read the source I realized it wouldn’t be very efficient or fast, due to its design which caters to every database backend.

I created MySQLCache to correct as many problems as I could see by using MySQL-specific syntax and features. My full list of improvements includes:

Using BLOB for value storage

DatabaseCache uses a TEXT column with base64 encoding - I guess because not every RDBMS has a binary data type. Unfortunately this increases the size of every value by 33% owing to the unused bits, wasting network packets and disk space. MySQLCache uses MySQL’s binary data type BLOB to avoid wasted bits.
SELECT COUNT(*) on cull

On every set() operation, DatabaseCache does a cull check to see if needs to remove any keys to remain under its specified max size. Unfortunately the only way it can do this is to run SELECT COUNT(*), which will cause a table scan on most database backends, especially on MySQL+InnoDB, as I’ve written about being slow before.

With MySQLCache the cull-check behaviour becomes probabilistic, by default performing the cull() check on only 1% of set() operations - making 99% of your write operations faster at the sacrifice of going a bit over the specified maximum number of items. You can also disable the cull-on-set behaviour altogether by setting the probability to 0 and perform it yourself in a background task so it never affects your end users.
Making use of MySQL’s upsert syntax.

This is an operation missing from many RDBMSs, which included PostgreSQL up until a couple of weeks ago. An upsert allows a single statement to perform either an UPDATE of existing data or an INSERT to create new data, race-condition free - for a longer explanation, see the detailed PostgreSQL wiki page.

MySQL has had the INSERT .. ON DUPLICATE KEY UPDATE statement for upserts for some time, which MySQLCache uses to implement the add(), set(), and incr() operations each with single queries. This avoids the race conditions that DatabaseCache is open to - the kind of behaviour that only crops up when your start-up actually gets popular and has many concurrent visitors.
The *_many operations all use single queries

DatabaseCache avoids creating multi-row queries because the syntax varies more by RDBMS, and instead when you call the set_many() operation, it simply calls set() repeatedly. It didn’t take much to upgrade MySQLCache, to perform the set_many(), get_many(), etc. as single queries. Users only need to be aware of the protocol packet size limit in MySQL, max_allowed_packet, which can easily be hit when setting or getting multiple large values.
Automatic compression with zlib

I copied this feature from pylibmc, which is a fast memcached client library. Even relatively small values of a few KB benefit from the reduction in network and storage size. MySQLCache also provides capacity for tuning the zlib compression level, or adding custom compression schemes.

Benchmark

I’ve created and run a simple benchmark app, available on GitHub. It tests the configured backends with 1000 repeats of operations such as set(), set_many(), and get() with randomly sized values. It’s not a perfect load simulator, and it doesn’t use the network, but it gives an idea of the relative speeds achievable by the different cache backends.

It runs MySQLCache with MySQL’s two most popular storage engines - the non-transactional MyISAM and the default, transactional InnoDB, as well as DatabaseCache and Django’s two Memcached backends, MemcachedCache and PyLibMCCache.

I ran it locally with MariaDB (a MySQL fork) version 10.0.17 and memcached version 1.4.22 on my Macbook Pro. The results are, in seconds per 1000-operation benchmark:

Cache Alias	set benchmark	set_many benchmark	get benchmark	get_many benchmark	delete benchmark	incr benchmark
`DatabaseCache`	1.023	42.193	0.237	8.621	6.216	1.084
`MySQLCache_MyISAM`	0.329	6.080	0.311	1.949	0.639	0.195
`MySQLCache_Innodb`	0.629	6.392	0.213	1.168	0.619	0.344
`MemcachedCache`	0.102	1.460	0.089	1.308	1.068	0.082
`PyLibMCCache`	0.075	2.218	0.061	0.488	1.725	0.057

MySQLCache clearly isn’t as fast as MemcachedCache, but to be fair, it is writing to disk. Thankfully it has come out quite a bit faster than DatabaseCache, especially on the *_many operations. I suspect the performance gap gets much bigger as the table grows to tens of thousands, or hundreds of thousands of rows, owing to the SELECT COUNT(*) behaviour - I just didn’t have the patience to keep running the benchmark.

As for the storage engines, MyISAM is faster than InnoDB for set() but slower for get() - not a great trade-off for caching where reads are much more common than writes. Also, if most of your application uses InnoDB and you mix it with MyISAM’s non-transactionality, it can lead to developer confusion as to what to expect.

I would have liked to benchmark the MEMORY storage engine, which would avoid writing to disk and be a fairer comparison with Memcached, but it can’t store BLOB columns (large binary data). There is a patch in the MySQL fork Percona Server to enable this, but I didn’t have time to try it and I guess most projects wouldn’t switch MySQL version so easily to use a cache. Of course, as stated, the permanence may be the very reason you are using a database cache backend, so any potential speed improvements it would give may not be worth worrying about :)

Future Work

MySQL 5.6 ships with a memcached interface plugin for InnoDB, avoiding MySQL protocol overhead and SQL parsing while remaining persistent. I haven’t tried it out yet, but it can probably be used directly with Django’s PyLibMCCache or MemcachedCache backends, although with a little setup to create the table and configure the plugin. Maybe django-mysql will provide helpers for the setup process in the future, and I can benchmark it against MySQLCache.

In MySQLCache I’ve also considered adding operations beyond Django’s basic API that can be efficiently supported due to the rows being stored in a table. For example, since keys are stored in order in the primary key BTree index, it is quite efficient to search for keys with a given prefix. Additionally, with MariaDB’s dynamic columns, hashes, lists, or sets could potentially be supported, making it a bit like Redis!

If you think of something, open an issue on GitHub and maybe send a pull request!

To use it…

If MySQLCache sounds appetising and is suitable for your project, you can pip install django-mysql, follow the library installation instructions, and follow the cache setup guide and you’ll be on your way.

Thanks!

😸😸😸 Check out my new book on using GitHub effectively, Boost Your GitHub DX! 😸😸😸

One summary email a week, no spam, I pinky promise.

Tags: django, mysql