2018-06-17

Data Filtering in a Django Website using Elasticsearch

In my Web Development with Django Cookbook section Forms and Views there is a recipe Filtering object lists. It shows you how to filter a Django QuerySet dynamically by different filter parameters selected in a form. From practice, the approach is working well, but with lots of data and complex nested filters, the performance might get slow. You know - because of all those INNER JOINS in SQL, the page might take even 12 seconds to load. And that is not preferable behavior. I know that I could denormalize the database or play with indices to optimize SQL. But I found a better way to increase the loading speed. Recently we started using Elasticsearch for one of the projects and its data filtering performance seems to be enormously faster: in our case, it increased from 2 to 16 times depending on which query parameters you choose.

What is Elasticsearch?

Elasticsearch is java-based search engine which stores data in JSON format and allows you to query it using special JSON-based query language. Using elasticsearch-dsl and django-elasticsearch-dsl, I can bind my Django models to Elasticsearch indexes and rewrite my object list views to use Elasticsearch queries instead of Django ORM. The API of Elasticsearch DSL is chainable like with Django QuerySets or jQuery functions, and we'll have a look at it soon.

The Setup

At first, let's install Elasticsearch server. Elasticsearch is quite a complex system, but it comes with convenient configuration defaults.

On macOS you can install and start the server with Homebrew:

$ brew install elasticsearch
$ brew services start elasticsearch

For other platforms, the installation instructions are also quite clear.

Then in your Django project's virtual environment install django-elasticsearch-dsl. I guess, "DSL" stands for "domain specific language".

With pipenv it would be the following from the project's directory:

$ pipenv install django-elasticsearch-dsl

If you are using just pip and virtual environment, then you would do this with your project's environment activated.

(venv)$ pip install django-elasticsearch-dsl

This, in turn, will install related lower level client libraries: elasticsearch-dsl and elasticsearch-py.

In the Django project settings, add 'django_elasticsearch_dsl' to INSTALLED_APPS.

Finally, add the lines defining default connection configuration there:

ELASTICSEARCH_DSL={
    'default': {
        'hosts': 'localhost:9200'
    },
}

Elasticsearch Documents for Django Models

For the illustration how to use Elasticsearch with Django, I'll create Author and Book models, and then I will create Elasticsearch index document for the books.

models.py

# -*- coding: UTF-8 -*-
from __future__ import unicode_literals

from django.db import models
from django.utils.translation import ugettext_lazy as _
from django.utils.encoding import python_2_unicode_compatible


@python_2_unicode_compatible
class Author(models.Model):
    first_name = models.CharField(_("First name"), max_length=200)
    last_name = models.CharField(_("Last name"), max_length=200)
    author_name = models.CharField(_("Author name"), max_length=200)

    class Meta:
        verbose_name = _("Author")
        verbose_name_plural = _("Authors")
        ordering = ("author_name",)

    def __str__(self):
        return self.author_name


@python_2_unicode_compatible
class Book(models.Model):
    title = models.CharField(_("Title"), max_length=200)
    authors = models.ManyToManyField(Author, verbose_name=_("Authors"))
    publishing_date = models.DateField(_("Publishing date"), blank=True, null=True)
    isbn = models.CharField(_("ISBN"), blank=True, max_length=20)

    class Meta:
        verbose_name = _("Book")
        verbose_name_plural = _("Books")
        ordering = ("title",)

    def __str__(self):
        return self.title

Nothing fancy here. Just an Author model with fields id, first_name, last_name, author_name, and a Book model with fields id, title, authors, publishing_date, and isbn. Let's go to the documents.

documents.py

In the same directory of your app, create documents.py with the following content:

# -*- coding: UTF-8 -*-
from __future__ import unicode_literals

from django_elasticsearch_dsl import DocType, Index, fields
from .models import Author, Book

# Name of the Elasticsearch index
search_index = Index('library')
# See Elasticsearch Indices API reference for available settings
search_index.settings(
    number_of_shards=1,
    number_of_replicas=0
)


@search_index.doc_type
class BookDocument(DocType):
    authors = fields.NestedField(properties={
        'first_name': fields.TextField(),
        'last_name': fields.TextField(),
        'author_name': fields.TextField(),
        'pk': fields.IntegerField(),
    }, include_in_root=True)

    isbn = fields.KeywordField(
        index='not_analyzed',
    )

    class Meta:
        model = Book # The model associated with this DocType

        # The fields of the model you want to be indexed in Elasticsearch
        fields = [
            'title',
            'publishing_date',
        ]
        related_models = [Author]

    def get_instances_from_related(self, related_instance):
        """If related_models is set, define how to retrieve the Book instance(s) from the related model."""
        if isinstance(related_instance, Author):
            return related_instance.book_set.all()

Here we defined a BookDocument which will have fields: title, publishing_date, authors, and isbn.

The authors will be a list of nested dictionaries at the BookDocument. The isbn will be a KeywordField which means that it will be not tokenized, lowercased, nor otherwise processed and handled the whole as is.

The values for those document fields will be read from the Book model.

Using signals, the document will be automatically updated either when a Book instance or Author instance is added, changed, or deleted. In the method get_instances_from_related(), we tell the search engine which books to update when an author is updated.

Building the Index

When the index document is ready, let's build the index at the server:

(venv)$ python manage.py search_index --rebuild

Django QuerySets vs. Elasticsearch Queries

The concepts of SQL and Elasticsearch queries are quite different. One is working with relational tables and the other works with dictionaries. One is using queries that are kind of human-readable logical sentences and another is using nested JSON structures. One is using the content verbosely and another does string processing in the background and gives search relevance for each result.

Even when there are lots of differences, I will try to draw analogies between Django ORM and elasticsearch-dsl API as close as possible.

1. Query definition

Django QuerySet:

queryset = MyModel.objects.all()

Elasticsearch query:

search = MyModelDocument.search()

2. Count

Django QuerySet:

queryset = queryset.count()

Elasticsearch query:

search = search.count()

3. Iteration

Django QuerySet:

for item in queryset:
    print(item.title)

Elasticsearch query:

for item in search:
    print(item.title)

4. To see the generated query:

Django QuerySet:

>>> queryset.query

Elasticsearch query:

>>> search.to_dict()

5. Filter by single field containing a value

Django QuerySet:

queryset = queryset.filter(my_field__icontains=value)

Elasticsearch query:

search = search.filter('match_phrase', my_field=value)

6. Filter by single field equal to a value

Django QuerySet:

queryset = queryset.filter(my_field__exact=value)

Elasticsearch query:

search = search.filter('match', my_field=value)

If a field type is a string, not a number, it has to be defined as KeywordField in the index document:

my_field = fields.KeywordField()

7. Filter with either of the conditions (OR)

Django QuerySet:

from django.db import models
queryset = queryset.filter(
    models.Q(my_field=value) |
    models.Q(my_field2=value2)
)

Elasticsearch query:

from elasticsearch_dsl.query import Q
search = search.query(
    Q('match', my_field=value) |
    Q('match', my_field2=value2)
)

8. Filter with all of the conditions (AND)

Django QuerySet:

from django.db import models
queryset = queryset.filter(
    models.Q(my_field=value) &
    models.Q(my_field2=value2)
)

Elasticsearch query:

from elasticsearch_dsl.query import Q
search = search.query(
    Q('match', my_field=value) & 
    Q('match', my_field2=value2)
)

9. Filter by values less than or equal to certain value

Django QuerySet:

from datetime import datetime

queryset = queryset.filter(
    published_at__lte=datetime.now(),
)

Elasticsearch query:

from datetime import datetime

search = search.filter(
    'range',
    published_at={'lte': datetime.now()}
)

10. Filter by a value in a nested field

Django QuerySet:

queryset = queryset.filter(
    category__pk=category_id,
)

Elasticsearch query:

from elasticsearch_dsl.query import Q

search = search.filter(
    'nested', 
    path='category', 
    query=Q('match', category__pk=category_id)
)

11. Filter by one of many values in a related model

Django QuerySet:

queryset = queryset.filter(
    category__pk__in=category_ids,
)

Elasticsearch query:

from django.utils.six.moves import reduce
from elasticsearch_dsl.query import Q

search = search.query(
    reduce(operator.ior, [
        Q(
            'nested', 
            path='category', 
            query=Q('match', category__pk=category_id),
        )
        for category_id in category_ids
    ])
)

Here the reduce() function combines a list of Q() conditions using the bitwise OR operator (|).

12. Ordering

Django QuerySet:

queryset = queryset.order_by('-my_field', 'my_field2')

Elasticsearch query:

search = search.sort('-my_field', 'my_field2')

13. Creating query dynamically

Django QuerySet:

import operator
from django.utils.six.moves import reduce

filters = []
if value1:
    filters.append(models.Q(
        my_field1=value1,
    ))
if value2:
    filters.append(models.Q(
        my_field2=value2,
    ))
queryset = queryset.filter(
    reduce(operator.iand, filters)
)

Elasticsearch query:

import operator
from django.utils.six.moves import reduce
from elasticsearch_dsl.query import Q

queries = []
if value1:
    queries.append(Q(
        'match',
        my_field1=value1,
    ))
if value2:
    queries.append(Q(
        'match',
        my_field2=value2,
    ))
search = search.query(
    reduce(operator.iand, queries)
)

14. Pagination

Django QuerySet:

from django.core.paginator import (
    Paginator, Page, EmptyPage, PageNotAnInteger
)

paginator = Paginator(queryset, paginate_by)
page_number = request.GET.get('page')
try:
    page = paginator.page(page_number)
except PageNotAnInteger:
    page = paginator.page(1)
except EmptyPage:
    page = paginator.page(paginator.num_pages)

Elasticsearch query:

from django.core.paginator import (
    Paginator, Page, EmptyPage, PageNotAnInteger
)
from django.utils.functional import LazyObject

class SearchResults(LazyObject):
    def __init__(self, search_object):
        self._wrapped = search_object

    def __len__(self):
        return self._wrapped.count()

    def __getitem__(self, index):
        search_results = self._wrapped[index]
        if isinstance(index, slice):
            search_results = list(search_results)
        return search_results

search_results = SearchResults(search)

paginator = Paginator(search_results, paginate_by)
page_number = request.GET.get('page')
try:
    page = paginator.page(page_number)
except PageNotAnInteger:
    page = paginator.page(1)
except EmptyPage:
    page = paginator.page(paginator.num_pages)

ElasticSearch doesn't work with Django's pagination by default. Therefore, we have to wrap the search query with lazy SearchResults class to provide the necessary functionality.

Example

I built an example with books written about Django. You can download it from Github and test it.

Takeaways

  • Filtering with Elasticsearch is much faster than with SQL databases.
  • But it comes at the cost of additional deployment and support time.
  • If you have multiple websites using Elasticsearch on the same server, configure a new cluster and node for each of those websites.
  • Django ORM can be in a way mapped to Elasticsearch DSL.
  • I summarized the comparison of Django ORM and Elasticsearch DSL, mentioned in this article, into a cheat sheet. Print it on a single sheet of paper and use it as a reference for your developments.

Get Django ORM vs. Elasticsearch DSL Cheat Sheet


Cover photo by Karl Fredrickson.

No comments:

Post a Comment