In my Web Development with Django Cookbook section Forms and Views there is a recipe Filtering object lists. It shows you how to filter a Django QuerySet dynamically by different filter parameters selected in a form. From practice, the approach is working well, but with lots of data and complex nested filters, the performance might get slow. You know - because of all those INNER JOINS in SQL, the page might take even 12 seconds to load. And that is not preferable behavior. I know that I could denormalize the database or play with indices to optimize SQL. But I found a better way to increase the loading speed. Recently we started using Elasticsearch for one of the projects and its data filtering performance seems to be enormously faster: in our case, it increased from 2 to 16 times depending on which query parameters you choose.
What is Elasticsearch?
Elasticsearch is java-based search engine which stores data in JSON format and allows you to query it using special JSON-based query language. Using elasticsearch-dsl and django-elasticsearch-dsl, I can bind my Django models to Elasticsearch indexes and rewrite my object list views to use Elasticsearch queries instead of Django ORM. The API of Elasticsearch DSL is chainable like with Django QuerySets or jQuery functions, and we'll have a look at it soon.
The Setup
At first, let's install Elasticsearch server. Elasticsearch is quite a complex system, but it comes with convenient configuration defaults.
On macOS you can install and start the server with Homebrew:
$ brew install elasticsearch
$ brew services start elasticsearch
For other platforms, the installation instructions are also quite clear.
Then in your Django project's virtual environment install django-elasticsearch-dsl. I guess, "DSL" stands for "domain specific language".
With pipenv it would be the following from the project's directory:
$ pipenv install django-elasticsearch-dsl
If you are using just pip and virtual environment, then you would do this with your project's environment activated.
(venv)$ pip install django-elasticsearch-dsl
This, in turn, will install related lower level client libraries: elasticsearch-dsl and elasticsearch-py.
In the Django project settings, add 'django_elasticsearch_dsl'
to INSTALLED_APPS
.
Finally, add the lines defining default connection configuration there:
ELASTICSEARCH_DSL={
'default': {
'hosts': 'localhost:9200'
},
}
Elasticsearch Documents for Django Models
For the illustration how to use Elasticsearch with Django, I'll create Author
and Book
models, and then I will create Elasticsearch index document for the books.
models.py
# -*- coding: UTF-8 -*-
from __future__ import unicode_literals
from django.db import models
from django.utils.translation import ugettext_lazy as _
from django.utils.encoding import python_2_unicode_compatible
@python_2_unicode_compatible
class Author(models.Model):
first_name = models.CharField(_("First name"), max_length=200)
last_name = models.CharField(_("Last name"), max_length=200)
author_name = models.CharField(_("Author name"), max_length=200)
class Meta:
verbose_name = _("Author")
verbose_name_plural = _("Authors")
ordering = ("author_name",)
def __str__(self):
return self.author_name
@python_2_unicode_compatible
class Book(models.Model):
title = models.CharField(_("Title"), max_length=200)
authors = models.ManyToManyField(Author, verbose_name=_("Authors"))
publishing_date = models.DateField(_("Publishing date"), blank=True, null=True)
isbn = models.CharField(_("ISBN"), blank=True, max_length=20)
class Meta:
verbose_name = _("Book")
verbose_name_plural = _("Books")
ordering = ("title",)
def __str__(self):
return self.title
Nothing fancy here. Just an Author
model with fields id
, first_name
, last_name
, author_name
, and a Book
model with fields id
, title
, authors
, publishing_date
, and isbn
. Let's go to the documents.
documents.py
In the same directory of your app, create documents.py
with the following content:
# -*- coding: UTF-8 -*-
from __future__ import unicode_literals
from django_elasticsearch_dsl import DocType, Index, fields
from .models import Author, Book
# Name of the Elasticsearch index
search_index = Index('library')
# See Elasticsearch Indices API reference for available settings
search_index.settings(
number_of_shards=1,
number_of_replicas=0
)
@search_index.doc_type
class BookDocument(DocType):
authors = fields.NestedField(properties={
'first_name': fields.TextField(),
'last_name': fields.TextField(),
'author_name': fields.TextField(),
'pk': fields.IntegerField(),
}, include_in_root=True)
isbn = fields.KeywordField(
index='not_analyzed',
)
class Meta:
model = Book # The model associated with this DocType
# The fields of the model you want to be indexed in Elasticsearch
fields = [
'title',
'publishing_date',
]
related_models = [Author]
def get_instances_from_related(self, related_instance):
"""If related_models is set, define how to retrieve the Book instance(s) from the related model."""
if isinstance(related_instance, Author):
return related_instance.book_set.all()
Here we defined a BookDocument
which will have fields: title
, publishing_date
, authors
, and isbn
.
The authors
will be a list of nested dictionaries at the BookDocument
. The isbn
will be a KeywordField
which means that it will be not tokenized, lowercased, nor otherwise processed and handled the whole as is.
The values for those document fields will be read from the Book
model.
Using signals, the document will be automatically updated either when a Book
instance or Author
instance is added, changed, or deleted. In the method get_instances_from_related()
, we tell the search engine which books to update when an author is updated.
Building the Index
When the index document is ready, let's build the index at the server:
(venv)$ python manage.py search_index --rebuild
Django QuerySets vs. Elasticsearch Queries
The concepts of SQL and Elasticsearch queries are quite different. One is working with relational tables and the other works with dictionaries. One is using queries that are kind of human-readable logical sentences and another is using nested JSON structures. One is using the content verbosely and another does string processing in the background and gives search relevance for each result.
Even when there are lots of differences, I will try to draw analogies between Django ORM and elasticsearch-dsl API as close as possible.
1. Query definition
Django QuerySet:
queryset = MyModel.objects.all()
Elasticsearch query:
search = MyModelDocument.search()
2. Count
Django QuerySet:
queryset = queryset.count()
Elasticsearch query:
search = search.count()
3. Iteration
Django QuerySet:
for item in queryset:
print(item.title)
Elasticsearch query:
for item in search:
print(item.title)
4. To see the generated query:
Django QuerySet:
>>> queryset.query
Elasticsearch query:
>>> search.to_dict()
5. Filter by single field containing a value
Django QuerySet:
queryset = queryset.filter(my_field__icontains=value)
Elasticsearch query:
search = search.filter('match_phrase', my_field=value)
6. Filter by single field equal to a value
Django QuerySet:
queryset = queryset.filter(my_field__exact=value)
Elasticsearch query:
search = search.filter('match', my_field=value)
If a field type is a string, not a number, it has to be defined as KeywordField
in the index document:
my_field = fields.KeywordField()
7. Filter with either of the conditions (OR)
Django QuerySet:
from django.db import models
queryset = queryset.filter(
models.Q(my_field=value) |
models.Q(my_field2=value2)
)
Elasticsearch query:
from elasticsearch_dsl.query import Q
search = search.query(
Q('match', my_field=value) |
Q('match', my_field2=value2)
)
8. Filter with all of the conditions (AND)
Django QuerySet:
from django.db import models
queryset = queryset.filter(
models.Q(my_field=value) &
models.Q(my_field2=value2)
)
Elasticsearch query:
from elasticsearch_dsl.query import Q
search = search.query(
Q('match', my_field=value) &
Q('match', my_field2=value2)
)
9. Filter by values less than or equal to certain value
Django QuerySet:
from datetime import datetime
queryset = queryset.filter(
published_at__lte=datetime.now(),
)
Elasticsearch query:
from datetime import datetime
search = search.filter(
'range',
published_at={'lte': datetime.now()}
)
10. Filter by a value in a nested field
Django QuerySet:
queryset = queryset.filter(
category__pk=category_id,
)
Elasticsearch query:
from elasticsearch_dsl.query import Q
search = search.filter(
'nested',
path='category',
query=Q('match', category__pk=category_id)
)
11. Filter by one of many values in a related model
Django QuerySet:
queryset = queryset.filter(
category__pk__in=category_ids,
)
Elasticsearch query:
from django.utils.six.moves import reduce
from elasticsearch_dsl.query import Q
search = search.query(
reduce(operator.ior, [
Q(
'nested',
path='category',
query=Q('match', category__pk=category_id),
)
for category_id in category_ids
])
)
Here the reduce()
function combines a list of Q()
conditions using the bitwise OR operator (|).
12. Ordering
Django QuerySet:
queryset = queryset.order_by('-my_field', 'my_field2')
Elasticsearch query:
search = search.sort('-my_field', 'my_field2')
13. Creating query dynamically
Django QuerySet:
import operator
from django.utils.six.moves import reduce
filters = []
if value1:
filters.append(models.Q(
my_field1=value1,
))
if value2:
filters.append(models.Q(
my_field2=value2,
))
queryset = queryset.filter(
reduce(operator.iand, filters)
)
Elasticsearch query:
import operator
from django.utils.six.moves import reduce
from elasticsearch_dsl.query import Q
queries = []
if value1:
queries.append(Q(
'match',
my_field1=value1,
))
if value2:
queries.append(Q(
'match',
my_field2=value2,
))
search = search.query(
reduce(operator.iand, queries)
)
14. Pagination
Django QuerySet:
from django.core.paginator import (
Paginator, Page, EmptyPage, PageNotAnInteger
)
paginator = Paginator(queryset, paginate_by)
page_number = request.GET.get('page')
try:
page = paginator.page(page_number)
except PageNotAnInteger:
page = paginator.page(1)
except EmptyPage:
page = paginator.page(paginator.num_pages)
Elasticsearch query:
from django.core.paginator import (
Paginator, Page, EmptyPage, PageNotAnInteger
)
from django.utils.functional import LazyObject
class SearchResults(LazyObject):
def __init__(self, search_object):
self._wrapped = search_object
def __len__(self):
return self._wrapped.count()
def __getitem__(self, index):
search_results = self._wrapped[index]
if isinstance(index, slice):
search_results = list(search_results)
return search_results
search_results = SearchResults(search)
paginator = Paginator(search_results, paginate_by)
page_number = request.GET.get('page')
try:
page = paginator.page(page_number)
except PageNotAnInteger:
page = paginator.page(1)
except EmptyPage:
page = paginator.page(paginator.num_pages)
ElasticSearch doesn't work with Django's pagination by default. Therefore, we have to wrap the search query with lazy SearchResults
class to provide the necessary functionality.
Example
I built an example with books written about Django. You can download it from Github and test it.
Takeaways
- Filtering with Elasticsearch is much faster than with SQL databases.
- But it comes at the cost of additional deployment and support time.
- If you have multiple websites using Elasticsearch on the same server, configure a new cluster and node for each of those websites.
- Django ORM can be in a way mapped to Elasticsearch DSL.
- I summarized the comparison of Django ORM and Elasticsearch DSL, mentioned in this article, into a cheat sheet. Print it on a single sheet of paper and use it as a reference for your developments.
Get Django ORM vs. Elasticsearch DSL Cheat Sheet
Cover photo by Karl Fredrickson.
No comments:
Post a Comment