[lang]

Present Perfect

Personal
Projects
Packages
Patches
Presents
Linux

Picture Gallery
Present Perfect

Python 2.7, JSON, and unicode

Filed under: couchdb,DAD,General,Python,Twisted — Thomas @ 17:28

2011-04-23
17:28

I have been hacking on Paisley more recently. I actually got to hack during work time on Paisley, because the guys needed a feature I had developed - change notification. But more on that later.

I eked out a four day hacking session over this easter weekend, and my primary goal is to make some good advances on my music database system using CouchDB. The code is still in prototype stage, and I wanted to start removing hacks, tightening things down, and adding tests. But suddenly I found myself having to add a bunch of unicode() calls on data coming back from paisley just because I was being stricter on the input to paisley functions.

I didn't want to have to deal with unicode, again, as it detracted me of the core of my application. But I didn't like paying the technical debt either of not understanding what was going wrong under the hood.

From my limited understanding, JSON is an object notation format used for exchange of information between processes and applications. A JSON string is in unicode, which is great. It would be pretty useless otherwise in today's world. So I should be able to send in unicode to JSON libraries and get unicode back out.

A recent change in Paisley by one of the maintainers prefers simplejson over the stdlib-json. I first thought this change was to blame for my problems.

And yes, when decoding a JSON object, text was returned as str instead of unicode objects. Now, this is only when the text is in fact ASCII and hence works fine both as str and as unicode. And I'm sure opinions will differ here - but I think that a JSON library should *always* deserialize text to the same type of object by default - ie, unicode.

Clearly, simplejson disagrees with me. But I didn't have this problem a few weeks ago, so something changed! What gives? And changing back to json over simplejson didn't fix it either!

After some googling, I stumbled upon this bug report. Apparently, in 2.7, the C-based implementation deserializes ASCII text as str instead of unicode. The Python-based one always returns unicode for text. And in previous Pythons, both always returned unicode for text.

In essence, my problem boiled down to this:


[thomas@ana ~]$ ipython
Python 2.7 (r27:82500, Sep 16 2010, 18:02:00)
Type "copyright", "credits" or "license" for more information.
IPython 0.10.2 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.
In [1]: from json import decoder
In [2]: decoder.py_scanstring('"str"', 1)
Out[2]: (u'str', 5)
In [3]: decoder.c_scanstring('"str"', 1)
Out[3]: ('str', 5)

versus


[py-2.6] [thomas@ana ~]$ python
Python 2.6.2 (r262:71600, Sep 29 2009, 21:49:07)
[GCC 4.4.1 20090725 (Red Hat 4.4.1-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from json import decoder
>>> decoder.py_scanstring('"str"', 1)
(u'str', 5)
>>> decoder.c_scanstring('"str"', 1)
(u'str', 5)

Note how the last command returned a normal str object.

And yes, in the past few weeks since I last tested this, I did indeed upgrade my machine to Fedora 14, pulling in Python 2.7

simplejson seems to always deserialize to str when it can. I would consider that a bug - ie, 'be strict in what you produce'.

As for Paisley, I made a feature-unicode branch on github, and this commit introduces a compatibility pjson module. By default it is STRICT, ie it wants unicode back always, and tests for the buggy behaviour, and does an alternative loads implementation that falls back to the python one. I'm sure some Paisley devs will prefer simplejson still, so you can change STRICTness and prefer simplejson.

Now, back to the hack.

CouchDB python unittest setUp/tearDown

Filed under: couchdb,Hacking,Python,Twisted — Thomas @ 23:44

2011-04-03
23:44

I've been hacking on Paisley again recently since I found it I am not the only current maintainer. There is a branch on github from which a 0.3 release was recently made.

That's good news, because I didn't really need a new project to maintain. But I still have code I want to see land there, so I'm working on merging branches between launchpad, github, and some of my experimental svn branches here and there.

I had just implemented a cache for the object view mapping using couchdb-python's mapping.py and it turns out someone else was interested in adding memcache support to cache document lookups.

Some discussion started on a possible API, and I took a stab at a first draft over the past week.

Separately from that, I also took a CouchDB training course for work (together with Marek, one of our developers) ran by the Couchbase (company merger of CouchOne, formerly CouchIO (?) and Membase) people. That was a good training - but I digress.

At night Marek told me that they have some 300 lines of code that sadly reuses some classes from the current work codebase to set up and tear down test cases that work against an actual couchdb instance. He didn't feel like rewriting all that code to not use some of work's code just so that it could be contributed to Paisley for example. I felt I could do it in less than 100 lines, but he didn't seem to believe me.

So here I am after a magnificent Jose Gonzalez concert at the Palau de la Musica which is right around the corner from me, trying to write the caching code, and realizing I can't properly test it together with the change notification listener I wrote.

So while I was watching an episode of Breaking Bad, I wrote the setUp and tearDown code to do just that - start a couchdb instance on a random port, get the port, and connect to it.

It's probably not perfect yet (I do a busy loop for the creation and filling of the log file to read the port), but it worked for my simple test case. And it's 74 lines of code, including docstrings (which Marek for some reason does not believe in) and comments (which Marek also not believes in).

It's being worked on in this branch and I hope to land that in the paisley tree soon.

Filed under: couchdb,Python,Releases,Twisted — Thomas @ 23:27

2010-11-24
23:27

I've been working on Paisley some more recently, finishing a first stab
at a document mapping API.

As discussed with Christopher Lenz a long time, I basically took his
mapping code and applied it to Paisley.

In my personal project I also added a caching version of the CouchDB
object, but I'm not yet convinced it is the right approach, so it's not
in Paisley yet. One of the things I think I will need to do to make
that useful is to have it listen to change notifications, so it can
change cached objects when they change in couchdb, and implement
notifications for these changes so that a program can be informed of
them too and react accordingly.

In any case, I'd like to work towards a release, so feel free to take a
look at
the branch I've made
to implement this on, give any feedback or
do any code review, and let me know.

I’m back

Filed under: Life,Spain,Twisted — Thomas @ 18:32

2010-06-09
18:32

So obviously, blog-wise I fell off the face of the earth for close to two months.

The immediate reason is some personal stuff happening to me that I needed to bounce back from (well, ok, I lied - it's not stuff, it's just one tiny little thing.)

As a result I haven't done much hacking at all, beside a few fruitful morituri hack sessions.

As a consequence, I don't have much useful to report, but I am going to slowly get back to some hacking. My Lego Mindstorms are already with me here in Barcelona so I am going to get started on that CD ripping robot Any Day Now.

I'll get more specific about what non-hacking stuff I've been up to recently after the fallout of the personal stuff, but for now I'll just mention I've been hugely enjoying getting back to playing basketball over the last year. A while ago Farid taught me a nice layup trick, and yesterday I had Pepe film it:

I haven't pulled that one off correctly during a game though!

Oh wait, I lied. Yesterday I got a proof of achievement of something hacker-related: my Spanish diploma in Twisted!

71154

I need to buy me a wall to hang that on, it's just too cool! And the back lists all skills achieved, in Spanish. Check this out:

"El manejo de errores robusto con diferidos". I'm sure that official had a field day translating deferred into Spanish.

Life! I'm back to eating you, one bite at a time. Make sure you're ready for me.

Twisted PB to JSON-RPC bridge

Filed under: Hacking,Python,Twisted — Thomas @ 11:58

2010-04-16
11:58

Day two of the internal platform training sessions. Today is hacking and bugfixing day.

I wanted to take a stab at the task of creating an RPC interface to expose our Perspective Broker interface. I have very little experience with RPC systems (apart from PB, and moap's use of Trac's RPC) so this was a good opportunity to get my feet wet.

Smart hacking is lazy hacking, so I started by Googling. After some false positives, I found a Twisted JSON-RPC project, and since it was maintained by Duncan McGreggor it gave me hope that it would work.

And so it did. I wrote a simple adapter object that takes an instance of a PB root and proxies jsonrpc_* calls to remote_* calls.

This is just a proof of concept; obviously there are many caveats. The main thing being that currently you can only use it for remote_ calls with simple objects that txjsonrpc supports (although it looks like for example it supports a deferred, so sweet).

Obviously, one of the attractions of PB is that you can transfer objects. At the least this bridge could be extended to support getting references to objects and then invoking methods on them or passing them as arguments.

In any case, good enough for a one hour hack.

The code is in my tests repository and you can check out with

svn co https://thomas.apestaart.org/thomas/svn/tests/twisted/pb2jr

As for actually using it - after writing it Jan and I discussed how it should be used, and he told me he doesn't actually want to run this in the same process where we run our PB interface. Instead he wants it to acts a proxy, which would at best mean doing code inspection or importing of the PB Root in the proxy to be able to provide the JSON-RPC interface dynamically, and you wouldn't be sure if the running code's PB Root is the same as whatever you have on disk in your proxy.

I'm not very convinced about the coupling argument, because this is just a thin layer on top of the PB server, and everything will end up going through the PB server anyway, so I don't see any gain from decoupling here.

I lost interest while discussing, so I'm going to leave it in its current state for now. Although I would have preferred to just plug ahead, do some of the cooler introspection bits, provide a web UI to look at methods and invoke them, and so on.

Next Page »
picture