DjangoTricks: PostgreSQL

Showing posts with label PostgreSQL. Show all posts

2022-04-09

Generic Functionality without Generic Relations

When you have some generic functionality like anything commentable, likable, or upvotable, it’s common to use Generic Relations in Django. The problem with Generic Relations is that they create the relationships at the application level instead of the database level, and that requires a lot of database queries if you want to aggregate content that shares the generic functionality. There is another way that I will show you in this article.

I learned this technique at my first job in 2002 and then rediscovered it again with Django a few years ago. The trick is to have a generic Item model where every other autonomous model has a one-to-one relation to the Item. Moreover, the Item model has an item_type field, allowing you to recognize the backward one-to-one relationship.

Then whenever you need to have some generic categories, you link them to the Item. Whenever you create generic functionality like media gallery, comments, likes, or upvotes, you attach them to the Item. Whenever you need to work with permissions, publishing status, or workflows, you deal with the Item. Whenever you need to create a global search or trash bin, you work with the Item instances.

Let’s have a look at some code.

Items

First, I'll create the items app with two models: the previously mentioned Item and the abstract model ItemBase with the one-to-one relation for various models to inherit:

# items/models.py
import sys

from django.db import models
from django.apps import apps

if "makemigrations" in sys.argv:
    from django.utils.translation import gettext_noop as _
else:
    from django.utils.translation import gettext_lazy as _


class Item(models.Model):
    """
    A generic model for all autonomous models to link to.
    
    Currently these autonomous models are available:
    - content.Post
    - companies.Company
    - accounts.User
    """
    ITEM_TYPE_CHOICES = (
        ("content.Post", _("Post")),
        ("companies.Company", _("Company")),
        ("accounts.User", _("User")),
    )
    item_type = models.CharField(
        max_length=200, choices=ITEM_TYPE_CHOICES, editable=False, db_index=True
    )

    class Meta:
        verbose_name = _("Item")
        verbose_name_plural = _("Items")

    def __str__(self):
        content_object_title = (
            str(self.content_object) if self.content_object else "BROKEN REFERENCE"
        )
        return (
            f"{content_object_title} ({self.get_item_type_display()})"
        )

    @property
    def content_object(self):
        app_label, model_name = self.item_type.split(".")
        model = apps.get_model(app_label, model_name)
        return model.objects.filter(item=self).first()


class ItemBase(models.Model):
    """
    An abstract model for the autonomous models that will link to the Item.
    """
    item = models.OneToOneField(
        Item,
        verbose_name=_("Item"),
        editable=False,
        blank=True,
        null=True,
        on_delete=models.CASCADE,
        related_name="%(app_label)s_%(class)s",
    )

    class Meta:
        abstract = True

    def save(self, *args, **kwargs):
        if not self.item:
            model = type(self)
            item = Item.objects.create(
                item_type=f"{model._meta.app_label}.{model.__name__}"
            )
            self.item = item
        super().save()

    def delete(self, *args, **kwargs):
        if self.item:
            self.item.delete()
        super().delete(*args, **kwargs)

Then let's create some autonomous models that will have one-to-one relations with the Item. By "autonomous models," I mean those which are enough by themselves, such as posts, companies, or accounts. Models like types, categories, tags, or likes, wouldn't be autonomous.

Posts

Second, I create the content app with the Post model. This model extends ItemBase which will create the one-to-one relation on save, and will define the item_type as content.Post:

# content/models.py
import sys

from django.contrib.auth.base_user import BaseUserManager
from django.db import models
from django.contrib.auth.models import AbstractUser

if "makemigrations" in sys.argv:
    from django.utils.translation import gettext_noop as _
else:
    from django.utils.translation import gettext_lazy as _

from items.models import ItemBase


class Post(ItemBase):
    title = models.CharField(_("Title"), max_length=255)
    slug = models.SlugField(_("Slug"), max_length=255)
    content = models.TextField(_("Content"))

    class Meta:
        verbose_name = _("Post")
        verbose_name_plural = _("Posts")

Companies

Third, I create the companies app with the Company model. This model also extends ItemBase which will create the one-to-one relation on save, and will define the item_type as companies.Company:

# companies/models.py
import sys

from django.contrib.auth.base_user import BaseUserManager
from django.db import models
from django.contrib.auth.models import AbstractUser

if "makemigrations" in sys.argv:
    from django.utils.translation import gettext_noop as _
else:
    from django.utils.translation import gettext_lazy as _

from items.models import ItemBase


class Company(ItemBase):
    name = models.CharField(_("Name"), max_length=255)
    slug = models.SlugField(_("Slug"), max_length=255)
    description = models.TextField(_("Description"))

    class Meta:
        verbose_name = _("Company")
        verbose_name_plural = _("Companies")

Accounts

Fourth, I'll have a more extensive example with the accounts app containing the User model. This model extends AbstractUser from django.contrib.auth as well as ItemBase for the one-to-one relation. The item_type set at the Item model will be accounts.User:

# accounts/models.py
import sys

from django.db import models
from django.contrib.auth.base_user import BaseUserManager
from django.contrib.auth.models import AbstractUser

if "makemigrations" in sys.argv:
    from django.utils.translation import gettext_noop as _
else:
    from django.utils.translation import gettext_lazy as _

from items.models import ItemBase


class UserManager(BaseUserManager):
    def create_user(self, username="", email="", password="", **extra_fields):
        if not email:
            raise ValueError("Enter an email address")
        email = self.normalize_email(email)
        user = self.model(username=username, email=email, **extra_fields)
        user.set_password(password)
        user.save(using=self._db)
        return user

    def create_superuser(self, username="", email="", password=""):
        user = self.create_user(email=email, password=password, username=username)
        user.is_superuser = True
        user.is_staff = True
        user.save(using=self._db)
        return user


class User(AbstractUser, ItemBase):
    # change username to non-editable non-required field
    username = models.CharField(
        _("Username"), max_length=150, editable=False, blank=True
    )
    # change email to unique and required field
    email = models.EmailField(_("Email address"), unique=True)
    bio = models.TextField(_("Bio"))

    USERNAME_FIELD = "email"
    REQUIRED_FIELDS = []

    objects = UserManager()

Creating new items

I will use the Django shell to create several autonomous model instances and the related Items too:

>>> from content.models import Post
>>> from companies.models import Company
>>> from accounts.models import User
>>> from items.models import Item
>>> post = Post.objects.create(
...     title="Hello, World!",
...     slug="hello-world",
...     content="Lorem ipsum…",
... )
>>> company = Company.objects.create(
...     name="Aidas & Co",
...     slug="aidas-co",
...     description="Lorem ipsum…",
... )
>>> user = User.objects.create_user(
...     username="aidas",
...     email="aidas@example.com",
...     password="jdf234oha&6sfhasdfh",
... )
>>> Item.objects.count()
3

Aggregating content from all those relations

Lastly, here is an example of having posts, companies, and users in a single view. For that, we will use the Item queryset with annotations:

from django import forms
from django.db import models
from django.shortcuts import render
from django.utils.translation import gettext, gettext_lazy as _

from .models import Item


class SearchForm(forms.Form):
    q = forms.CharField(label=_("Search"), required=False)
    

def all_items(request):
    qs = Item.objects.annotate(
        title=models.Case(
            models.When(
                item_type="content.Post", 
                then="content_post__title",
            ),
            models.When(
                item_type="companies.Company", 
                then="companies_company__name",
            ),
            models.When(
                item_type="accounts.User",
                then="accounts_user__email",
            ),
            default=models.Value(gettext("<Untitled>")),
        ),
        description=models.Case(
            models.When(
                item_type="content.Post",
                then="content_post__content",
            ),
            models.When(
                item_type="companies.Company",
                then="companies_company__description",
            ),
            models.When(
                item_type="accounts.User", 
                then="accounts_user__bio",
                ),
            default=models.Value(""),
        ),
    )
    
    form = SearchForm(data=request.GET, prefix="search")
    if form.is_valid():
        query = form.cleaned_data["q"]
        if query:
            qs = qs.annotate(
                search=SearchVector(
                    "title",
                    "description",
                )
            ).filter(search=query)

    context = {
        "queryset": qs,
        "search_form": form,
    }
    return render(request, "items/all_items.html", context)

Final words

You can have generic functionality and still avoid multiple hits to the database by using the Item one-to-one approach instead of generic relations.

The name of the Item model can be different, and you can even have multiple such models for various purposes, for example, TaggedItem for tags only.

Do you use anything similar in your projects?

Do you see how this approach could be improved?

Let me know in the comments!

Cover picture by Pixabay

2018-05-27

QuerySet Filters on Many-to-many Relations

Django ORM (Object-relational mapping) makes querying the database so intuitive, that at some point you might forget that SQL is being used in the background.

This year at the DjangoCon Europe Katie McLaughlin was giving a talk and mentioned one thing that affects the SQL query generated by Django ORM, depending on how you call the QuerySet or manager methods. This particularity is especially relevant when you are creating your QuerySets dynamically. Here it is. When you have a many-to-many relationship, and you try to filter objects by the fields of the related model, every new filter() method of a QuerySet creates a new INNER JOIN clause. I won't discuss whether that's a Django bug or a feature, but these are my observations about it.

The Books and Authors Example

Let's create an app with books and authors, where each book can be written by multiple authors.

# -*- coding: UTF-8 -*-
from __future__ import unicode_literals

from django.db import models
from django.utils.translation import ugettext_lazy as _
from django.utils.encoding import python_2_unicode_compatible


@python_2_unicode_compatible
class Author(models.Model):
    first_name = models.CharField(_("First name"), max_length=200)
    last_name = models.CharField(_("Last name"), max_length=200)
    author_name = models.CharField(_("Author name"), max_length=200)

    class Meta:
        verbose_name = _("Author")
        verbose_name_plural = _("Authors")
        ordering = ("author_name",)

    def __str__(self):
        return self.author_name


@python_2_unicode_compatible
class Book(models.Model):
    title = models.CharField(_("Title"), max_length=200)
    authors = models.ManyToManyField(Author, verbose_name=_("Authors"))
    publishing_date = models.DateField(_("Publishing date"), blank=True, null=True)

    class Meta:
        verbose_name = _("Book")
        verbose_name_plural = _("Books")
        ordering = ("title",)

    def __str__(self):
        return self.title

The similar app with sample data can be found in this repository.

Inefficient Filter

With the above models, you could define the following QuerySet to select books which author is me, Aidas Bendoraitis.

queryset = Book.objects.filter(
    authors__first_name='Aidas',
).filter(
    authors__last_name='Bendoraitis',
)

We can check what SQL query it would generate with str(queryset.query) (or queryset.query.__str__()).

The output would be something like this:

SELECT `libraryapp_book`.`id`, `libraryapp_book`.`title`, `libraryapp_book`.`publishing_date`
FROM `libraryapp_book`
INNER JOIN `libraryapp_book_authors` ON ( `libraryapp_book`.`id` = `libraryapp_book_authors`.`book_id` )
INNER JOIN `libraryapp_author` ON ( `libraryapp_book_authors`.`author_id` = `libraryapp_author`.`id` )
INNER JOIN `libraryapp_book_authors` T4 ON ( `libraryapp_book`.`id` = T4.`book_id` )
INNER JOIN `libraryapp_author` T5 ON ( T4.`author_id` = T5.`id` )
WHERE (`libraryapp_author`.`first_name` = 'Aidas' AND T5.`last_name` = 'Bendoraitis')
ORDER BY `libraryapp_book`.`title` ASC;

Did you notice, that the database table libraryapp_author was attached through the libraryapp_book_authors table to the libraryapp_book table TWICE where just ONCE would be enough?

Efficient Filter

On the other hand, if you are defining query expressions in the same filter() method like this:

queryset = Book.objects.filter(
    authors__first_name='Aidas',
    authors__last_name='Bendoraitis',
)

The generated SQL query will be much shorter and (theoretically) would perform faster:

SELECT `libraryapp_book`.`id`, `libraryapp_book`.`title`, `libraryapp_book`.`publishing_date`
FROM `libraryapp_book`
INNER JOIN `libraryapp_book_authors` ON ( `libraryapp_book`.`id` = `libraryapp_book_authors`.`book_id` )
INNER JOIN `libraryapp_author` ON ( `libraryapp_book_authors`.`author_id` = `libraryapp_author`.`id` )
WHERE (`libraryapp_author`.`first_name` = 'Aidas' AND `libraryapp_author`.`last_name` = 'Bendoraitis')
ORDER BY `libraryapp_book`.`title` ASC;

The same SQL query can be achieved using the Q() objects:

queryset = Book.objects.filter(
    models.Q(authors__first_name='Aidas') &
    models.Q(authors__last_name='Bendoraitis')
)

The Q() objects add a lot of flexibility to filters allowing to OR, AND, and negate query expressions.

Dynamic Filtering

So to have faster performance, when creating QuerySets dynamically, DON'T use filter() multiple times:

queryset = Book.objects.all()
if first_name:
    queryset = queryset.filter(
        authors__first_name=first_name,
    )
if last_name:
    queryset = queryset.filter(
        authors__last_name=last_name,
    )

DO this instead:

filters = models.Q()
if first_name:
    filters &= models.Q(
        authors__first_name=first_name,
    )
if last_name:
    filters &= models.Q(
        authors__last_name=last_name,
    )
queryset = Book.objects.filter(filters)

Here the empty Q() doesn't have any impact for the generated SQL query, so you don't need the complexity of creating a list of filters and then joining all of them with the bitwise AND operator, like this:

import operator
from django.utils.six.moves import reduce

filters = []
if first_name:
    filters.append(models.Q(
        authors__first_name=first_name,
    ))
if last_name:
    filters.append(models.Q(
        authors__last_name=last_name,
    ))
queryset = Book.objects.filter(reduce(operator.iand, filters))

Profiling

In DEBUG mode, you can check how long the previously executed SQL queries took by checking django.db.connection.queries:

>>> from django.db import connection
>>> connection.queries
[{'sql': 'SELECT …', 'time': '0.001'}, {'sql': 'SELECT …', 'time': '0.004'}]

The Takeaways

When querying many-to-many relationships, avoid using multiple filter() methods, make use of Q() objects instead.
You can check the SQL query of a QuerySet with str(queryset.query).
Check the performance of recently executed SQL queries with django.db.connection.queries.
With small datasets, the performance difference is not so obvious. For your specific cases you should do the benchmarks yourself.

Cover photo by Tobias Fischer.

2017-12-08

From MySQL to PostgreSQL

In this article I will guide you through the steps I had to take to migrate Django projects from MySQL to PostgreSQL.

MySQL database has proven to be a good start for small and medium scale projects. It is known and used widely and has good documentation. Also there are great clients for easy management, like phpMyAdmin (web), HeidiSQL (windows), or Sequel Pro (macOS). However, in my professional life there were unfortunate moments, when databases from different projects crashed because of large queries or file system errors. Also I had database integrity errors which appeared in the MySQL databases throughout years because of different bugs at the application level.

When one thinks about scaling a project, they have to choose something more suitable. It should be something that is fast, reliable, and well supports ANSI standards of relational databases. Something that most top Django developers use. And the database of choice for most professionals happens to be PostgreSQL. PostgreSQL enables using several vendor-specific features that were not possible with MySQL, e.g. multidimensional arrays, JSON fields, key-value pair fields, special case-insensitive text fields, range fields, special indexes, full-text search, etc. For a newcomer, the best client that I know - pgAdmin (macOS, linux, windows) - might seem too complex at first, compared with MySQL clients, but as you have Django administration and handy ORM, you probably won't need to inspect the database in raw format too often.

So what does it take to migrate from MySQL to PostgreSQL? We will do that in a few steps and we will be using pgloader to help us with data migration. I learned about this tool from Louise Grandjonc, who was giving a presentation about PostgreSQL query optimization at DjangoCon Europe 2017.

One prerequisite for the migration are passing tests. You need to have functional tests to check if all pages are functioning correctly and unit tests to check at least the most critical or complex classes, methods, and functions.

1. Prepare your MySQL database

Make sure that your production MySQL database migration state is up to date:

(env)$ python manage.py migrate --settings=settings.production

Then create a local copy of your production MySQL database. We are going to use it for the migration.

2. Install pgloader

As I mentioned before, for the database migration we will use a tool called pgloader (version 3.4.1 or higher). This tool was programmed by Dimitri Fontaine and is available as an open source project on GitHub. You can compile the required version from the source. Or if you are using macOS, you can install it with Homebrew:

$ brew update
$ brew install pgloader

Note that PostgreSQL will also be installed as a dependency.

3. Create a new PostgreSQL user and database

Unlike with MySQL, creating new database users and databases with PostgreSQL usually happen in the shell rather than in the database client.

Let's create a user and database with the same name myproject.

$ createuser --createdb --password myproject
$ createdb --username=myproject myproject

The --createdb parameter will enable privilege to create databases. The --password parameter will offer to enter a password. The --username parameter will set the owner of the created database.

4. Create the schema

Link the project to this new PostgreSQL database in the settings, for example:

DATABASES = {
    'postgresql': {
        'ENGINE': 'django.db.backends.postgresql_psycopg2',
        'NAME': get_secret("DATABASE_NAME"),
        'USER': get_secret("DATABASE_USER"),
        'PASSWORD': get_secret("DATABASE_PASSWORD"),
    },
}
DATABASES['default'] = DATABASES['postgresql']

Here the custom get_secret() function returns sensitive information from environment variables or a text file that is not tracked under version control. Its implementation is up to you.

Run the migrations to create tables and foreign key constraints in the new PostgreSQL database:

(env)$ python manage.py migrate --settings=settings.local

5. Create the data migration script

The pgloader uses configuration files with the settings defining how to deal with migrations. Let's create the configuration file myproject.load with the following content:

LOAD DATABASE
    FROM mysql://mysql_username:mysql_password@localhost/mysql_dbname
    INTO postgresql:///myproject
    WITH truncate, data only, disable triggers, preserve index names, include no drop, reset sequences
    ALTER SCHEMA 'mysql_dbname' RENAME TO 'public'
;

6. Run data migration

Now it's time to copy the data:

$ pgloader myproject.load

Typically you will get a bunch of warnings about type conversions. These can probably be ignored, because the script will take its best guess how to convert data when importing. If in addition you get errors about duplicated data or tables with foreign keys to missing entries, you will need to fix the issues at MySQL database and then repeat the process. In that case, clean up the MySQL database, update your local copy, recreate PostgreSQL database with dropdb and createdb commands, run Django migrations to create the database schema, and copy the data again.

7. Adapt the code

When the database is successfully migrated, we should run Django project tests and fix all PostgreSQL-specific problems in the project's code. The code running Django ORM will run smoothly, but very likely there will be issues with raw SQL, QuerySet's extra() method, and type conversions.

Typically, these are the differences that you might have to keep in mind:

String values in PostgreSQL queries are always quoted with 'single quotes'.
PostgreSQL doesn't convert types when comparing values automatically as MySQL does. If you use any raw SQL, you will need to do some casting before comparison like CAST(blog_post.id AS text) = likes_like.object_id or blog_post.id::text = likes_like.object_id. The latter double-colon syntax is not understood by MySQL, so if you want to support both databases, you will need to write vendor-specific cases for each database management system.
PostgreSQL is case-sensitive for string comparisons, so in the QuerySet filters you will need to use *__iexact lookup instead *__exact and *__icontains lookup instead of *__contains.

When ordering, convert the column to lowercase with the Lower() function:

from django.db import models
posts = Post.objects.order_by(models.Lower('title'))

When using *__in lookup, make sure that the type of the listed elements matches the type of the model field. For example, you may have a Like model with generic relation, i.e. content_type and object_id fields that together combine a generic foreign key to any model instance. The object_id field is usually of a string type, but the IDs of related models might be integers, strings, or UUIDs. If you then want to get the liked Posts which primary keys are integers, you would need to convert the object_id values to integers before assigning them to the pk__in lookup in the filter:
```
liked_ids = map(int, Like.objects.filter(
    user=request.user,
    content_type=ContentType.objects.get_for_model(Post)
).values("object_id", flat=True))

liked_posts = Post.objects.filter(pk__in=liked_ids)
```

8. Repeat the process for production

When you are sure that the migration process is fluent and all Django tests pass, you can take your production website down, repeat the migration process locally with the latest production data, copy the migrated local database to production server, update the production code, install new dependencies, and take the website back online.

To create a database dump you can use command:

$ pg_dump --format=c --compress=9 --file=myproject.backup myproject

To restore or create the database from dump use commands:

$ dropdb --username=pgsql myproject
$ createdb --username=myproject myproject
$ pg_restore --dbname=myproject --role=myproject --schema=public myproject.backup

I might probably miss some points and there are some ways to automate the upgrade process for production, but you got the idea.

Conclusion

PostgreSQL is more restrictive than MySQL, but it provides greater performance, more stability, and better compliance with standards. In addition, in PostgreSQL there is a bunch of features that were not available in MySQL. If you are lucky, you can switch your project from MySQL to PostgreSQL in one day.

Cover photo by Casey Allen