The last few months, as a result of summer, holidays, and catching up with work backlog has made it impossible to spend any useful time on my music programming. I have the various pieces in halfway-finished state. I should do a first release of the cd ripper I made; the mixer sort-of-works-most-of-the-time but is still command line and does not have play controls; a Django-based rewrite of DAD is pretty much functional up to the point where I can rate tracks and migrate the old DAD database; and I bought the LEGO mindstorms set to make the CD ripper robot.
However, I still had one nagging problem I wasn’t sure yet how I would end up dealing with. The problem is conceptually simple.
I want to rate tracks on different computers and have this be synchronized to all of them.
There’s two parts to this problem. One is ‘what is the same track’ ? Linking properly to a database of unique id’s would be the best way to do this, and it seems MusicBrainz over the years has evolved to more or less solve that problem. I should still run more tests over my collection, but it seems I can resolve most of the duplicate tracks I own to the same ‘song’.
The second problem is a technology problem – how do you synchronize database information between the various machines – my work and home desktop, my laptop, the elisa/##moovida box I have, Kristien’s ipod, my N800, and my future car player ?
Well, it seems I haven’t been paying attention to databases recently, and I also didn’t really look into Stuart‘s musings about desktop CouchDB. Recently when Marc-André Lureau arrived in Barcelona and we went out for dinner, Zaheer and him upon hearing my sad tale of woe brought up CouchDB again.
All I remembered from a brief look at CouchDB is that it was a schemaless document store written in Erlang. For some reason I completely missed the replication mechanism and conflict resolution it has built-in, which to date has been the closest match to what I think I need for my application.
So I’ve been reading the CouchDB manual recently, which has given me once again that same warm fuzzy feeling as when I originally got turned on to GStreamer (by reading the excellent manual at the time) and Twisted (same thing, the docs did it for me).
And now I’m torn. Switching to CouchDB might be the right move, but I don’t know if it is. I have no idea how I should restructure the applications around it, how I should store the data the best way in CouchDB, and so on. The main reason for wanting to switch would definately be the replication model. Not having SQL or a reasonable way of querying the data (at least as far as I can tell currently, unless all data for a track would be stored in a single document) is a big turnoff.
So I’m wondering what other options are out there, or how people have solved similar problems ? Is CouchDB’s replication system a side effect of their design and RESTful API ? Or is that kind of replication possible in a more structured database ? Is it possible to put lipstick on the pig that is traditional SQL to have replication between sometimes-online computers a reality ? Are there other database servers out there that may be a better fit ?
I’ve been Googling a little, finding things like Cassandra (interestingly also an Apache incubator project like CouchDB), Project Voldemort, Persevere, Hypertable, Redis, Tokyo Cabinet, and probably some more.
I’m not even sure what the right search terms are for the features I want. It seems eventually consistent is one of them (as in – eventually the separate databases on each machine will end up having the same data), although in my case the eventually would come from the machines not always being able to directly communicate by design, not just because of an operating failure.
In the meantime, I am paralyzed with the choice between staying with Django for a while more as I develop some more features (and manually syncing the database between computers) or already porting to something like CouchDB.
I am seriously considering learning CouchDB a bit better, possibly starting with a rewrite of yagtd against a CouchDB database – it seems such an obvious match. I currently manage my yagtd todo.txt file in subversion, and having to commit and update is pure mental friction, not to mention resolving conflicts when I forgot to do the commit/update in the first place.
I would think that a GTD task list would be a perfect match for CouchDB’s document model and replication strategy, and would definately correctly solve the generic task list problem.
Enrich my vocabulary and choice of options! Suggest stuff in the comments.