Sparing queries when displaying Haystack results

December 14, 2009 Tags: django, haystack, django-community, sql

Playing with Haystack and query optimizations...

A few days ago I've done some query optimization on a pet Django project. First, let's have a look at the database layout. I have an app called feeds, and here is what the models look like:

class Category(models.Model):
    name = models.CharField(_('Name'), max_length=50)
    color = models.CharField(_('Color'), max_length=50)
    # some other attributes

class Feed(models.Model):
    name = models.CharField(_('Name'), max_length=255)
    category = models.ForeignKey(Category, verbose_name=_('Category'),
                                 related_name='feeds')
    # some other attributes

class Entry(models.Model):
    feed = models.ForeignKey(Feed, verbose_name=_('Feed'),
                             related_name='entries')
    title = models.CharField(_('Title'), max_length=255)
    date = models.DateTimeField(_('Date'))
    read = models.BooleanField(_('Read'), default=False, db_index=True)
    content = models.TextField(_('Content'))
    # some other attributes

I have a list of Entry instances to display with some information about the related Feed and Category. Everything can be fetched in a single query using select_related() and only() QuerySet methods:

entries = Entry.objects.filter(...).only('id', 'feed', 'title',
                                         'date', 'read').select_related()

Now, let's say you have added search capabilities with Haystack. The Entry model is indexed, and can be searched this way:

sqs = SearchQuerySet().filter(some_args)
form = ModelSearchForm(data=request.GET,
                       searchqueryset=sqs,
                       load_all=True)
if form.is_valid():
    query = form.cleaned_data['q']

if query:
    results = form.search()
else:
    results = EmptySearchQuerySet()

At this point we have a list of results, and we can iterate over the list in our templates.

{% for result in results %}
  {{ result.object.title }} in {{ result.object.feed.category.name }}
{% endfor %}

However, this is going to hit the database a few times at each iteration because we're also displaying related objects. Here's what Haystack says:

Note that iterating over SearchResults and getting the object for each result will do O(N) database queries, which may not fit your needs for performance.

Wouldn't it be great to fetch the whole result set in a single query? We'll have to build the query by hand but that's the price to pay to save so many milliseconds. In that particular case, we have only one type of object and that simplifies the process a lot.

First, we need to get the IDs of the objects to fetch. Each SearchResult has an id that looks like <app_label>.<model>.<id>. In this case, for example, feeds.entry.12345. Here's the list of the primary keys to fetch:

ids = (int(result.id.replace('feeds.entry.', '')) for result in results)

We have to keep in mind that Haystack gives a list of results ordered by relevance. We have to keep this order, and therefore build a SQL query with a custom ordering statement. This can be achieved with the CASE statement:

SELECT * from feeds_entry
WHERE feeds_entry.id IN (10094, 10092, 10373)
ORDER BY CASE
    WHEN feeds_entry.id=10094 THEN 0
    WHEN feeds_entry.id=10092 THEN 1
    WHEN feeds_entry.id=10373 THEN 2
    ELSE 3
END

This is the syntax for PostgreSQL, the MySQL syntax is slightly different. With this statement, we can specify the ordering we want: we just have to add a WHEN statement for each result.

So, what we have to do is to write the main SQL template (with the needed JOINs) and the IN and CASE clauses:

cases_template = """
    WHEN feeds_entry.id=%(id)s THEN %(score)s"""
cases = ''
in_clause = []
for (score, pk) in enumerate(ids):
    in_clause.append(str(pk))
    cases = cases + cases_template % {'id': pk, 'score': score}
in_clause = ', '.join(in_clause)

# The final query
sql_query = """
SELECT feeds_entry.id, feeds_entry.feed_id, feeds_entry.title,
       feeds_entry.date, feeds_entry.read, feeds_feed.name,
       feeds_category.color
FROM feeds_entry
     LEFT JOIN feeds_feed ON feeds_entry.feed_id=feeds_feed.id
     LEFT JOIN feeds_category ON feeds_feed.category_id=feeds_category.id
WHERE feeds_entry.id IN (%(in)s)
ORDER BY CASE %(cases)s
    ELSE %(length)s
END
""" % {'cases': cases, 'in': in_clause, 'length': score}

An the final step is to execute the query and parse the results! That piece of code is boring but QuerySet.raw() isn't part of Django yet:

entries = [] # This is the QuerySet of the results (kinda)
if results: # Don't execute the query if there is no result
    cursor = connection.cursor()
    cursor.execute(sql_query)

    for row in cursor.fetchall():
        category = Category(color=row[6])
        feed = Feed(id = row[1], name=row[5], category=category)
        entry = Entry(id=row[0], title=row[2], date=row[3],
                      read=row[4], feed=feed)

        entries.append(entry)

Finally, we've fetched the entire result set —including related objects— ordered by relevance and in a single query. That may be very useful if the database is the bottleneck of your app.

Of course, this is just the SQL way of doing it. We could also do the ordering (and slicing/pagination) in our python code, making it database-agnostic. That was just a pretext to learn a not-so-common SQL statement!

Comments

Add a comment

Comments are closed for this entry.

Short URL

http://bruno.im/e4