geodjango: django tests break when creating new postgis test databases

The new postgis 2.0 library (easily installed via `sudo port install postgis2 +postgresql92` if you are using macports as mentioned in my previous post) helps us convert a standard postgresql database into a spatially-aware database via the simple `CREATE EXTENSION postgis;` command:

$  psql92 -p 5433 -U postgres -d your_db

psql92 (9.2.1)
Type "help" for help.
your_db=#  CREATE EXTENSION postgis;

Unfortunately, when we run our django tests, `test_your_db`, which is newly created by django’s test runner whenever the test begins, will not be postgis-enabled. This leads immediately to test failure when our code fails to recognize the Geometry fields in our project’s GeoDjango models.

To get around this problem, we will need to specifically set up a `template_postgis` template database, much like what we did before postgis 2.0 (reference django docs https://docs.djangoproject.com/en/dev/ref/contrib/gis/install/postgis/#creating-a-spatial-database-template-for-earlier-versions).  But because I am running an instance of postgresql 9.1 server and an instance of 9.2 server concurrently, I will need to modify my commands a little.

Like this:-

calvin$ psql92 -p 5433 -U postgres
psql92 (9.2.1)
Type "help" for help.

postgres=# CREATE DATABASE template_postgis ENCODING='utf-8';
CREATE DATABASE
postgres=# UPDATE pg_database SET datistemplate='true' WHERE datname='template_postgis';
UPDATE 1
postgres=# \q

calvin$ POSTGIS_SQL_PATH=/opt/local/share/postgresql92/contrib/postgis-2.0

calvin$ psql92 -p 5433 -U postgres -d template_postgis -f $POSTGIS_SQL_PATH/postgis.sql

calvin$ psql92 -p 5433 -U postgres -d template_postgis -f $POSTGIS_SQL_PATH/spatial_ref_sys.sql

calvin$ psql92 -p 5433 -U postgres -d template_postgis -c "GRANT ALL ON geometry_columns TO PUBLIC;"
GRANT

calvin$ psql92 -p 5433 -U postgres -d template_postgis -c "GRANT ALL ON geography_columns TO PUBLIC;"
GRANT

calvin$ psql92 -p 5433 -U postgres -d template_postgis -c "GRANT ALL ON spatial_ref_sys TO PUBLIC;"
GRANT

With this, running `./manage.py test` with your geodjango-based application will work perfectly fine because django’s default test runner will look for the template_postgis template database to instantiate our now spatially-aware test database.

Python lists

Python lists are not really lists based on computer science’s definition of the word.  Classically trained programmers who are new to Python may be confused why a python list’s `append` method is so much more efficient than its `insert` method.

The classical list (not the python list) – what computer scientists call a linked list – is implemented as a series of nodes, each node keeping a reference to the next node.  We can imagine such a linked list in Python like this:-

class Node:
    def __init__(self, value, next=None):
        self.value = value
        self.next = next

# Usage
>>>    L = Node("a", Node("b", Node("c", Node("d"))))
>>>    L.next.next.value
'c'

Computer scientists call this a “singly linked list”, as opposed to a “double linked list”.  In a “double linked list”, each node will also keep a reference to the previous node so it is “bi-directional” whereas our singly linked list example here only points to the next node and does not “remember” the previous node.

But Python’s list type is implemented in a different way.  Instead of several separate nodes referencing each other, a Python list is a single contiguous slab of memory.  Computer scientists call this an “array”.

Understanding this fundamentals  reveal our implementation (and performance) differences.

1.  Iterating over Python List and Linked List

When iterating over the contents of a list, both are equally efficient.  But there’s some (resource) overhead in the linked list.

2.  Accessing an element in a Python List  vs an element in a Linked List

When directly accessing an element in a given index, our Python list (an “array”) is a lot more efficient because the position of the element can be calculated and the right memory location accessed directly (since it is in a contiguous slab of memory)!  To access an element in the linked list, we will need to traverse the list from the beginning (much like traversing a DOM tree in HTML).

3.  Inserting vs Appending into a Python List compared to a Linked List

The biggest puzzle, as mentioned initially, is the difference between `insert` and `append`.  `insert` in a linked list is very cheap – no matter how many nodes we have in our linked list, insertion takes roughly the same amount of time.  This is precisely because our linked list’s nodes are at different memory location.

On the other hand, the advantage we have gained from using Python’s list being an array that occupies a contiguous slab of memory is now lost if we attempt insertion because this requires that we move all elements that are on the right of the insertion point, possibly even moving all the elements to a larger array (a completely new memory slab).  This also explains why `append` is efficient for a Python list since `append` means inserting at the end of the memory slab where there are no elements on its right.

python threading bug: ‘_DummyThread’ object has no attribute ‘_Thread__block’

This bug, filed here – http://bugs.python.org/issue14308 - occurs because of a bad interaction between dummy thread objects created by the threading API when we call threading.currentThread() on a foreign thread.  And in particular, because of the _after_fork feature which is called to clean up resources (triggered by `os.fork()` method).

Stephen White also provided a code snippet that demonstrates this problem:-


import os
import thread
import threading
import time

def t():
    threading.currentThread() # Populate threading._active with a DummyThread
    time.sleep(3)

thread.start_new_thread(t, ())

time.sleep(1)

pid = os.fork()
if pid == 0:
    os._exit(0)
    os.waitpid(pid, 0)

Running this script will give you “no attribute ‘_Thread__block'” error, as explained.  For detailed explanations and a monkey-patch solution without modifying python source code, this is a good resource - http://stackoverflow.com/questions/13193278/understand-python-threading-bug

It so happens that django-debug-toolbar’s middleware causes exactly this problem.  And it’s extremely annoying to have my django dev server printing out ‘_DummyThread’ object has no attribute ‘_Thread__block’ in my terminal stdout repeatedly whenever my DebugToolbarMiddleware is enabled.

MIDDLEWARE_CLASSES += (
    'debug_toolbar.middleware.DebugToolbarMiddleware',
)

So here’s my pull request to resolve this issue on django-debug-toolbar - https://github.com/django-debug-toolbar/django-debug-toolbar/pull/333.  I have also taken the liberty to “upgrade” the original use of the thread module to threading module in this pull request. thread module will no longer be available in Python3 but threading module will, so in my opinion, it’s better to simply using the threading module!

Further criticisms and suggestions to improve welcome.