Python 2.7, JSON, and unicode |
2011-04-23
|
I have been hacking on Paisley more recently. I actually got to hack during work time on Paisley, because the guys needed a feature I had developed - change notification. But more on that later.
I eked out a four day hacking session over this easter weekend, and my primary goal is to make some good advances on my music database system using CouchDB. The code is still in prototype stage, and I wanted to start removing hacks, tightening things down, and adding tests. But suddenly I found myself having to add a bunch of unicode() calls on data coming back from paisley just because I was being stricter on the input to paisley functions.
I didn't want to have to deal with unicode, again, as it detracted me of the core of my application. But I didn't like paying the technical debt either of not understanding what was going wrong under the hood.
From my limited understanding, JSON is an object notation format used for exchange of information between processes and applications. A JSON string is in unicode, which is great. It would be pretty useless otherwise in today's world. So I should be able to send in unicode to JSON libraries and get unicode back out.
A recent change in Paisley by one of the maintainers prefers simplejson over the stdlib-json. I first thought this change was to blame for my problems.
And yes, when decoding a JSON object, text was returned as str instead of unicode objects. Now, this is only when the text is in fact ASCII and hence works fine both as str and as unicode. And I'm sure opinions will differ here - but I think that a JSON library should *always* deserialize text to the same type of object by default - ie, unicode.
Clearly, simplejson disagrees with me. But I didn't have this problem a few weeks ago, so something changed! What gives? And changing back to json over simplejson didn't fix it either!
After some googling, I stumbled upon this bug report. Apparently, in 2.7, the C-based implementation deserializes ASCII text as str instead of unicode. The Python-based one always returns unicode for text. And in previous Pythons, both always returned unicode for text.
In essence, my problem boiled down to this:
[thomas@ana ~]$ ipython
Python 2.7 (r27:82500, Sep 16 2010, 18:02:00)
Type "copyright", "credits" or "license" for more information.
IPython 0.10.2 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.
In [1]: from json import decoder
In [2]: decoder.py_scanstring('"str"', 1)
Out[2]: (u'str', 5)
In [3]: decoder.c_scanstring('"str"', 1)
Out[3]: ('str', 5)
versus
[py-2.6] [thomas@ana ~]$ python
Python 2.6.2 (r262:71600, Sep 29 2009, 21:49:07)
[GCC 4.4.1 20090725 (Red Hat 4.4.1-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from json import decoder
>>> decoder.py_scanstring('"str"', 1)
(u'str', 5)
>>> decoder.c_scanstring('"str"', 1)
(u'str', 5)
Note how the last command returned a normal str object.
And yes, in the past few weeks since I last tested this, I did indeed upgrade my machine to Fedora 14, pulling in Python 2.7
simplejson seems to always deserialize to str when it can. I would consider that a bug - ie, 'be strict in what you produce'.
As for Paisley, I made a feature-unicode branch on github, and this commit introduces a compatibility pjson module. By default it is STRICT, ie it wants unicode back always, and tests for the buggy behaviour, and does an alternative loads implementation that falls back to the python one. I'm sure some Paisley devs will prefer simplejson still, so you can change STRICTness and prefer simplejson.
Now, back to the hack.
Why not look at Python 3, where the unicode/str distinction has been abolished, in favour of using unicode strings throughout?
Comment by Simon — 2011-04-25 @ 22:50
i haven’t tested it, but it seems to be fixed in python 2.7.1
(commit: http://hg.python.org/cpython/rev/a778b963eae3),
also changelog says (http://svn.python.org/projects/python/tags/r271/Misc/NEWS)
“””
json.loads() on str should always return unicode (regression
from Python 2.6).
“””
Comment by gabor — 2011-05-04 @ 13:35