None Sept. 17, 2018

Django development guidelines by Initech

As one of the leading Israeli-based software development companies that has specialized in the Django framework for over 8 years, we would like to share some of our internal Django development guidelines with the community. We bring to you some of the guidelines that we established for our internal development process. We believe that they can be applied in most of the server-side projects to their benefit, as long as they are developed using Django.

We hope it will help you to make the most of the beautiful and powerful tool that is Django framework.

Don’t hesitate to contact us if you need expert review for your project or you want to develop a new product.

Let’s talk!

contact@initech.co.il

Initech Software Services Ltd

www.initech.co.il

 

pip requirements

We use an approach, developed by Kenneth Reitz, described in this blog post: https://www.kennethreitz.org/essays/a-better-pip-workflow

The idea is that you have virtualenv, created with the correct python interpreter, and inside you use pip to install and update packages.

The developer maintains two types of files (in git): one, called requirements_to_freeze.txt, lists the direct (not transitive) dependencies of the code the developer wrote, and the other, called requirements.txt, lists all installed packages with their exact versions for reproducibility (especially important to be able to deploy to prod exact version which was tested by QA).

So if the code says “import foo”, foo is listed in the first file. That file does not specify the version of foo, unless the developer knows that the code as written will not work with the latest version of foo. Only in that case, the version is specified, usually as <x and="" breaks="" code.="" not="=X," our="" say="" that="" to="" version="" x=""> </x>

The first file is updated every time when new direct dependencies are added or removed by the developer. The file can and should contain comments explaining what a non-obvious dependency used for and which non-obvious OS/native packages need to be installed to install it.

After the developer installed some versions of packages and tested the code (at least locally), the developer should “lock” the exact versions with which the code worked.

This is done by running  “pip freeze -r requirements_to_freeze.txt > requirements.txt”.

requirements.txt is often manually edited to add nice header reminding devs to not edit it manually.

Often different environments need to have different packages. For example, zeev likes to install pyinstrument or django-debug-toolbar, neither of which should be installed on servers.

To do that, create a file called after the environment, and write in it for example:

-r requirements.txt

django-debug-toolbar

Be very careful to always pip freeze a clean newly created venv that does not contain any packages that are for the development environment only, like profilers, source code linters, etc.

 

.env secrets file

We don’t store secrets (account passwords, tokens, cryptographic keys) in git.

When deploying to a server, the server needs to have access to passwords, access keys, tokens and keys.

So we create a file called “.env” (will dot in the beginning) which contains key-value pairs of secrets and configuration for the environment where it runs.

It is never committed to git, and it is important that .gitignore will list .env as a non-versioned file.

A .env.template file can be created that lists the keys but not the values that are expected to be filled in .env.

We use django-environ to read the .env file.

After finding the file in the same directory as manage.py (not necessarily the current directory, for example management commands can be called from other directories), reading and parsing it, values from it are assigned to django settings.

This file minimally contains the values of: database connection string, django secret (used for hmac of forgot my password tokens), debug.

In addition, the file with determine, which settings file will be used, as described below.

 

Settings files

Some differences between environments (developer laptop, staging server, prod server) are just values, but sometimes you need to have different source code depending on the environment, and putting it all in a single settings.py with if statements that depend on config values is not pleasant. For this reason, we create a settings file per environment. Most values are shared, so we put the common values in a single base settings file and import * from that file as the first line of the per-environment settings file. This way the per-environment settings file can override anything written in the shared settings file.

Developers are encouraged to create files like zeev.py in the settings directory and write down their special stuff in there and keep it in git.

The settings files should not contain passwords, we have .env file for that, as described above.

To make manage.py and wsgi.py load the right per-environment settings file, we use ENVIRONMENT config value in .env. So manage.py loads .env, decides which settings file to load, and then continues. Then base settings reads .env again. Same story for wsgi.py.

So in manage.py:

# read .env first time to decide which settings file to use

env = environ.Env()

environ.Env.read_env('.env')

os.environ.setdefault('DJANGO_SETTINGS_MODULE', env('ENVIRONMENT'))

 

And in base settings.py:

import environ

root = environ.Path(__file__) - 2  # three folder back (/a/b/c/ - 3 = /)

env = environ.Env()

# read .env file second and final time

environ.Env.read_env(os.path.join(str(root), '.env'))

# SECURITY WARNING: keep the secret key used in production secret!

SECRET_KEY = env('SECRET_KEY')

# SECURITY WARNING: don't run with debug turned on in production!

DEBUG = env('DEBUG', cast=bool, default=False)

DATABASES = {'default': env.db()}

 

Request concurrency

We use gevent to process many requests in a single process at the same time with synchronous-looking code.

The magic of gevent is that you get libuv/libev event loop but your code is 100% unmodified, as if you’re in a synchronous python with no concurrency.

We use multiple gunicorn workers to use all cores in a server (one worker per core, “--workers 4” command line argument to gunicorn).

 

To use gevent, we need to make sure all our dependencies are compatible with gevent.

The only issue is the database client psycopg2 for postgresql and mysqlclient for mysql.

So we use psycogreen in addition to psycopg2 and PyMySQL instead of mysqlclient.

 

One thing that does not work is forking using “import multiprocessing”, but we can either avoid monkey patching multiprocessing or we can have a config variable that doesn’t monkey patch gevent for some management commands. Running a subprocess works fine.

 

To get the monkey patching:

1. We pass “--worker-class gevent” argument to gunicorn

2. We monkey patch in manage.py and in wsgi.py

from gevent import monkey

monkey.patch_all()

from psycogreen.gevent import patch_psycopg

patch_psycopg()

 

or

 

from gevent import monkey

monkey.patch_all()

import pymysql

pymysql.install_as_MySQLdb()

So, why do we bother?

With gevent, we can have one single-threaded process per core on the server, with each process handling thousands of http requests simultaneously, provided almost all of them are waiting for IO (either the database or the network). No manual locking/synchronization code is required. I prefer this system because our apps are rarely CPU-bound and the code is very uncomplicated.

If we ever do something CPU-bound, I expect we can extract it to an external service and maybe even put it on the other side of a queue. If an app is so popular the customer can save significant amounts of money on monthly server costs by running fewer servers, we'll rewrite in go, or run the python on top of go, or whatever (they'll bring in top engineers from google/facebook to scale the application, it won't be our problem).

 

 

The above shows an endpoint that runs 10 https requests to the same server in parallel (with arguments 1..10) and prints their output. Each https request executes a single sql statement that waits 5 seconds and then calculates the square of the input. If you can reuse the https connections then waiting for 10 sql statements that each takes 5 seconds only takes ~5 seconds. For those five seconds, 10 sql statements, 11 http requests and 10 http clients all wait for IO in parallel in a single process, with very simple linear code.

 

import gevent

import requests

from requests.auth import HTTPBasicAuth

from django.db import connection

from django.http import HttpResponse

 

def slow_sql(request):

   x = int(request.GET.get('x', 1))

   with connection.cursor() as cursor:

       cursor.execute("SELECT pg_sleep(5); SELECT %d ^ 2;" % x)

       row = cursor.fetchone()

   return HttpResponse("%d" % int(row[0]))

 

def ten_slow_requests(request):

   requests_session = requests.Session()

   url = 'http://' + request.META['HTTP_HOST'] + '/slow-sql/'

   result = []

 

   def green_worker(x):

       response = requests_session.get(

           url, params={'x': x}, auth=HTTPBasicAuth('4care', '4care'))

       result.append(response.text)

 

   greenlets = []

   for i in range(1, 11):

       greenlets.append(gevent.spawn(green_worker, i))

   gevent.joinall(greenlets)

 

   return HttpResponse('\n'.join(result))

 

How do database connections work?

 

When we use psycogreen package to add gevent-specific code to psycopg2 async hooks, we get a database connect per greenlet instead of a database connection per thread, which allows every greenlet (every django http request) to have its own independent database connection and all greenlets (django http requests) can do independent database operations. This allows a single threaded server to wait on 10 slow sql requests in parallel.

This is what happens on the postgresql server:

initechuser@ubuntu:~/project/forcare$ pstree -p | grep postgres

         |-postgres(1407)-+-postgres(1484)

         |       |-postgres(1485)

         |       |-postgres(1486)

         |       |-postgres(1487)

         |       `-postgres(1488)

initechuser@ubuntu:~/project/forcare$ pstree -p | grep postgres

         |-postgres(1407)-+-postgres(1484)

         |       |-postgres(1485)

         |       |-postgres(1486)

         |       |-postgres(1487)

         |       |-postgres(1488)

         |       |-postgres(45962)

         |       |-postgres(45963)

         |       |-postgres(45964)

         |       |-postgres(45965)

         |       |-postgres(45966)

         |       |-postgres(45967)

         |       |-postgres(45968)

         |       |-postgres(45969)

         |       |-postgres(45970)

         |       `-postgres(45971)

initechuser@ubuntu:~/project/forcare$ pstree -p | grep postgres

         |-postgres(1407)-+-postgres(1484)

         |       |-postgres(1485)

         |       |-postgres(1486)

         |       |-postgres(1487)

         |       `-postgres(1488)

 

10 sql queries means 10 postgres worker processes. When they are done, they die.

(postgres is multi-process, surprisingly).

 

#פינת הידע של איניטק #django #development #initech #software