[lang]

Present Perfect

Personal
Projects
Packages
Patches
Presents
Linux

Picture Gallery
Present Perfect

music system blocked

Filed under: Hacking — Thomas @ 6:41 pm

2009-8-31
6:41 pm

The last few months, as a result of summer, holidays, and catching up with work backlog has made it impossible to spend any useful time on my music programming. I have the various pieces in halfway-finished state. I should do a first release of the cd ripper I made; the mixer sort-of-works-most-of-the-time but is still command line and does not have play controls; a Django-based rewrite of DAD is pretty much functional up to the point where I can rate tracks and migrate the old DAD database; and I bought the LEGO mindstorms set to make the CD ripper robot.

However, I still had one nagging problem I wasn’t sure yet how I would end up dealing with. The problem is conceptually simple.

I want to rate tracks on different computers and have this be synchronized to all of them.

There’s two parts to this problem. One is ‘what is the same track’ ? Linking properly to a database of unique id’s would be the best way to do this, and it seems MusicBrainz over the years has evolved to more or less solve that problem. I should still run more tests over my collection, but it seems I can resolve most of the duplicate tracks I own to the same ‘song’.

The second problem is a technology problem – how do you synchronize database information between the various machines – my work and home desktop, my laptop, the elisa/##moovida box I have, Kristien’s ipod, my N800, and my future car player ?

Well, it seems I haven’t been paying attention to databases recently, and I also didn’t really look into Stuart‘s musings about desktop CouchDB. Recently when Marc-André Lureau arrived in Barcelona and we went out for dinner, Zaheer and him upon hearing my sad tale of woe brought up CouchDB again.

All I remembered from a brief look at CouchDB is that it was a schemaless document store written in Erlang. For some reason I completely missed the replication mechanism and conflict resolution it has built-in, which to date has been the closest match to what I think I need for my application.

So I’ve been reading the CouchDB manual recently, which has given me once again that same warm fuzzy feeling as when I originally got turned on to GStreamer (by reading the excellent manual at the time) and Twisted (same thing, the docs did it for me).

And now I’m torn. Switching to CouchDB might be the right move, but I don’t know if it is. I have no idea how I should restructure the applications around it, how I should store the data the best way in CouchDB, and so on. The main reason for wanting to switch would definately be the replication model. Not having SQL or a reasonable way of querying the data (at least as far as I can tell currently, unless all data for a track would be stored in a single document) is a big turnoff.

So I’m wondering what other options are out there, or how people have solved similar problems ? Is CouchDB’s replication system a side effect of their design and RESTful API ? Or is that kind of replication possible in a more structured database ? Is it possible to put lipstick on the pig that is traditional SQL to have replication between sometimes-online computers a reality ? Are there other database servers out there that may be a better fit ?

I’ve been Googling a little, finding things like Cassandra (interestingly also an Apache incubator project like CouchDB), Project Voldemort, Persevere, Hypertable, Redis, Tokyo Cabinet, and probably some more.

I’m not even sure what the right search terms are for the features I want. It seems eventually consistent is one of them (as in – eventually the separate databases on each machine will end up having the same data), although in my case the eventually would come from the machines not always being able to directly communicate by design, not just because of an operating failure.

In the meantime, I am paralyzed with the choice between staying with Django for a while more as I develop some more features (and manually syncing the database between computers) or already porting to something like CouchDB.

I am seriously considering learning CouchDB a bit better, possibly starting with a rewrite of yagtd against a CouchDB database – it seems such an obvious match. I currently manage my yagtd todo.txt file in subversion, and having to commit and update is pure mental friction, not to mention resolving conflicts when I forgot to do the commit/update in the first place.

I would think that a GTD task list would be a perfect match for CouchDB’s document model and replication strategy, and would definately correctly solve the generic task list problem.

Enrich my vocabulary and choice of options! Suggest stuff in the comments.

11 Comments »

  1. I’m still looking for a music system that looks over all of my files. Decides how many tracks are present on many albums and then remove all the duplicates and changes them to symlinks to the original.

    Preferably it would move the original to the next album on the list if I wanted to delete the album that currently has the original file.

    That would save me lots of space.

    Comment by Mårten W — 2009-8-31 @ 8:00 pm

  2. Other terms you might want to look for: “master-master replication” (or “multi-master replication”) and “conflict resolution”

    Another document store that I’ve been very exited about is MongoDB — http://www.mongodb.org. It has excellent Python bindings, its query interface is a little more traditional than CouchDB’s, and it’s a hell of a lot faster right now, too. Unfortunately, it’s probably not what you want for this project. It has only experimental support for master-master replication, so it’s ill suited to the kind of distributed, eventually consistent system you’re building.

    On the Django side of things, there is some experimental and hacky support for layering the Django ORM on top of CouchDB, but it really doesn’t seem worth it. A much better setup, IMO, would be to use a traditional relational database for things like user accounts, site settings, and administration and use CouchDB directly for storing your real data.

    I’ve looked at Hypertable a bit too. My feeling is that it’s not quite ready yet for “production” use. It probably also isn’t quite what you want: it’s emulating Google BigTable, and there’s the assumption there that the data is going to live on a cluster, sharing a distributed file system (like the Hadoop DFS). As you say you want a system in which the machines might now always be able to communicate by design, that requirement would seem to eliminate Hypertable.

    Comment by Joe Shaw — 2009-8-31 @ 9:21 pm

  3. CouchDB is awesome, you should definitely look at it. I’ve converted a few of my sites from PHP/MySQL to Python/CouchDB and it’s a much better fit for many of the problems in small community websites. CouchDB is very slick and focused (though maybe not as fast as MongoDB right now — but their document format seems exceedingly ugly).

    Comment by Dirkjan Ochtman — 2009-8-31 @ 10:02 pm

  4. Storing all data related to a track within a single document seems like a very good setup. CouchDB works primarily by writing map/reduce functions, which are actually quite simple, once you get the hang of it. For instance, if your track documents are structured as

    {_id: , title: …, artist: …, album: …, rating: …}

    then a view that returns all songs by an artist could be written as

    map = function(track) {
    emit(track.artist, track);
    }

    reduce = function(artist, tracks) {
    return tracks;
    }

    Comment by Daniel Schierbeck — 2009-8-31 @ 10:03 pm

  5. Well, obviously I’m going to comment on this :-)

    Couch is a jolly nice way to do this sort of thing.

    In the next release of Ubuntu there’s Desktop Couch, where every user gets their own CouchDB to store data in (this is the stuff I’ve been writing about; more details are coming soon). Desktop Couch already handles sharing your data between machines; you just “pair” the two machines (say, your laptop and your netbook) and data saved into one ends up in the other without you having to do any work at all (either as a user or as developer of an application which saves data into CouchDB).

    I massively, massively want to share music metadata between the two moovida boxen I have and my laptop. When I create a playlist in one place it should appear in all the others too. Having this happen is one of my driving use-cases for desktop Couch :-)

    I’d really like to chat about this more; chase me up for an IRC chat or a beer or something!

    Comment by sil — 2009-8-31 @ 11:54 pm

  6. I suggest sticking with Django. Look up Django CouchDB on google and there’s good tutorials. Remember though, that CouchDB doesn’t have permissions, and CouchDB does end up as a bit of a large prereq if you are planning for others to use it. Sqlite is easy with it being a single file.

    Comment by Joe Tennies — 2009-9-1 @ 12:34 am

  7. Hey hey

    Thought i’d chime in as i’ve spent most of my type in GNOME yak shaving about sync and replication :]

    The replication you’d get from couch is arguably mostly as result of the RESTful API and the simple versioning features. I hacked up a replication server for Tracker 0.7 (i.e. SPARQL) that allows a couch db to pull from a tracker store (over http). It took a couple of hours to get the basics. The feature that couch provides that would be (probably) difficult to emulate in your setup is that it can store multiple states of a document when conflicts occur. One is the winner but the others are accessible and are replicated as well.

    If you don’t want to throw away much code: I imagine its pretty easy to represent your current schema as json and so it would be pretty easy to expose your data over http using the couch protocol. With tracker-replicator, it was 250 lines of vala to get tracker accessible by a couch pull.

    Neither the native couch (as far as i know) or the approach i’m playing with will automatically solve conflict problems, you’ll have to provide that logic.

    I have some more thoughts on your setup, but would like to know what your schema is right now and what your options are for things like the N800 and the ipod. Is the N800 expected to sync or stream? Will it be running custom software or will software on a host pc scrape it at sync time. Those kind of things.

    John

    Comment by John Carr — 2009-9-1 @ 12:40 am

  8. Maybe is Midgard for you?
    http://www.midgard-project.org/

    Comment by Opoho — 2009-9-1 @ 12:47 am

  9. Well, this is by no means a new problem, and it never ceases to amaze me that there is no proer solution.
    I got around the syncronizing issue by using central server software like http://www.fireflymediaserver.org – since DAAP support is quite common these days.

    Of course this means there is one database, but the metadata (ratings) need to be present in the files as well. Like this you can only replicate ratings from the server, but not to it. (Unless you update the music files on the server)

    Of course I encountered a couple of issues when I listened to my music on my digital audio player (rockbox’d iriver H320). Some ratings where, others weren’t. This is of course due to different types of metadata in the files. Some were picked up by rockbox, other’s weren’t.
    I’ve tried many music players since then and found none to reflect all the rating tags.

    I found that replication from the server like this works alright. If it was possible to rate a song using DAAP (or UPnP, …) and the server would also update the actual metadata in the file this would be even better, I think. Too bad development has ceased on Firefly; Is a feature like this in the works for banshee/rhythmbox/songbird/amarok?

    Comment by fizze — 2009-9-1 @ 9:04 am

  10. While it’s cool that couchdb can sync with another couchdb instance, it will not run e.g. on your ipod, so you will have to implement some kind of integration yourself.

    You might have come across this article already, it was an eye-opener for me:
    http://blog.labnotes.org/2007/09/20/read-consistency-dumb-databases-smart-services/

    Also have a look at the google wave architecture:
    http://www.waveprotocol.org/whitepapers/google-wave-architecture

    This way the problem is reduced to two components:
    - how do you identify resources distributed across multiple locations (this track on my server corresponds to that track on my ipod)
    - how do you handle update feeds to get things synced up

    Comment by Greg — 2009-9-1 @ 4:01 pm

  11. Also check out Prophet: http://syncwith.us/prophet/download

    Comment by Anonymous — 2009-9-2 @ 3:38 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment

picture