thomas.apestaart.org

Adventures in fingerprinting

Filed under: DAD,Fedora,GStreamer — Thomas @ 20:55

2011-08-09
20:55

One of the key concepts in my rewrite of DAD is that it should be possible to relate the same track across different files and computers. I have copies of files, and different encodings of the same track, spread across machines. Various applications I use for playback seem to exist in isolation on each machine, and so I tend to rate only occasionally knowing that my ratings aren't centralized. And I get annoyed when banshee detects three copies of an album, and then orders them by track number, playing each track three times before moving on to the next one.

The logical way to do is is through acoustic fingerprinting. These are algorithms that extract certain features from an audio file and calculate an algorithm-specific 'fingerprint' for it. Usually, these fingerprints are not identical across different encodings of the same file, so you can't look up twins in a list; but the fingerprints can be compared to each other and a 'difference' within a certain confidence interval calculated.

Most fingerprinting algorithms have a library that calculates a fingerprint and then submits it to a complimentary web service where it can quickly compare it to find twins.

In the past, either the client library/application or the web service (or both) was not open enough to be of interest for most Free Software people.

But recently, someone in the #morituri channel mentioned acoustid which only consists of open components. So, that seemed interesting enough to try out!

The chromaprint client-side library consists of a library, a sample application (linked against FFMPEG), and a python module with some sample scripts.

There is also a gst-chromaprint GStreamer plug-in on github. (As a side note, amazing to see that GStreamer plug-ins these days come for free! I recall the days when we had to the work ourselves to write GStreamer plug-ins for libraries)

So, after giving them a quick test run, I packaged up the whole set and it's now available for Fedora 14 and 15 in my package repositories

The chromaprint-tools package contains fpcalc and you need to enable rpmfusion-nonfree to get its ffmpeg dependency.

And after that, I created a Task in DAD for chromaprint, and now I have:

$ dad analyze chromaprint /opt/davedina/audio/albums/Afghan\ Whigs\ -\ Gentlemen/Afghan\ Whigs\ -\ Debonair.ogg ** Message: pygobject_register_sinkfunc is deprecated (GstObject) /opt/davedina/audio/albums/Afghan Whigs - Gentlemen/Afghan Whigs - Debonair.ogg: Found 1 results - Found 4 recordings. - musicbrainz id: 62b2952a-4605-4793-8b79-9f9745ea5da5 - artist: The Afghan Whigs - title: Debonair - musicbrainz id: 8ff78e73-f8bd-4d78-b562-c3e939fb93fb - artist: The Afghan Whigs - title: Debonair - musicbrainz id: a0d5ced6-43e8-450a-bf11-94f1f4520b92 - artist: The Afghan Whigs - title: Debonair - musicbrainz id: d01ac720-874c-48d6-95c6-a2cb66f9d5d0 - artist: The Afghan Whigs - title: Debonair

Sweet...

Now it's time to dump that in the couchdb database backend, and start identifying duplicate tracks.

Acoustid seems to be a relatively young project, but its maintainer is very active on the mailing list and it's filling a hole in the open world that I'm happy to see filled! Thank you Lukas.

Comments (2)

Step 1

Filed under: GStreamer,Hacking — Thomas @ 13:24

2011-08-06
13:24

[root@ana ~]# rpm -Uhv /home/thomas/rpm/RPMS/x86_64/gstreamer011-* Preparing... ########################################### [100%] 1:gstreamer011 ########################################### [ 33%] 2:gstreamer011-devel ########################################### [ 67%] 3:gstreamer011-debuginfo ########################################### [100%]

Sweet!

Comments (2)

Digital Audio Database

Filed under: DAD — Thomas @ 01:07

01:07

Over the past few years I've been quietly exploring ideas for my ideal music application. When I lived together in that great house in Gent, we had a hacky set of PHP code that let us import music, rate it, and have it play back. It worked for our purposes, but it was a collection of hacky PHP code and hacky Perl code.

Now I'm not saying I got that much better at coding, but I'm sure I improved a little bit. I've always put off actually writing the damn code to replace it, and hence I have a bunch of separate music collections - the music I was listening to in that house (properly rated, but very outdated), random collections of downloads, and now the collection of CD's I bought ever since leaving that house that never quite made it into my computer and are now being imported by the Lego robot.

Over a year ago, I re-implemented the mixing backend on top of GNonLin, which for the most part works as long as I don't actually dereference tracks played - somethign to figure out at some point. I have ideas about a pure web-based mixing backend as well, but I need to learn modern stuff like JQuery first.

But the missing key really was something that handles the database part well enough, because my application should work distributed - it should manage my tracks on all my devices, including all my computers, and be able to figure out that some crappy mp3 of a song on my laptop is the same song as the flac version at home on my NAS. So if I rate that crappy mp3 on my laptop, I want that taken into account when my home machine creates a mix.

And for me, CouchDB promised to fill that niche. Except of course that I spent the last year figuring out how I can marry CouchDB's approach to replication with my natural desire to denormalize. It turns out that's possible with CouchDB, but it involves doing a lot of client-side caching (and invalidating/changing on change notifications) and is already pretty slow when I do it for my 14000 test tracks.

So, I've decided to experiment in a world where normalization is not needed, and I'm just going to pick one central concept (The 'track'), store as much related data into that document as possible (on each computer, the fragments of audio files that represent that track; its ratings; what album it's on; which artists made the track), treat some of those values as caches for the last known value from parent documents, and just go for speed first and see how that goes.

Yes, I am going to relax about not having everything perfect on the inside, so I can move on and write some more code that I can actually use.

I enjoyed a lot trying to shoehorn CouchDB into my relational wordview, but I want to see what life is like on the other side.

Before I was also very focused on migrating my old data (from the music I had when I was in the house in Gent) and its ratings. That's still important to me, but I think right now I'd more enjoy having something that lets me listen to and rate new music. When I originally wrote DAD I didn't expect to be getting so much music that wasn't from CD's. That's obviously not the case anymore, and I'm probably one of the last maniacs still buying CD's and worrying about getting them sample-perfect onto my NAS. In today's reality I need to deal with having the same track fifteen times, in various qualities, and I wish my computer handled that for me.

As part of this shift in approach, both in how I use CouchDB and what music I now want to listen to, I'm going to build the code from the opposite side I've been doing, focusing on smaller building blocks and getting the experience right. Step one will be collecting the right data about audio files, splitting them into individual fragments, and loading music in two passes into the databases. I'll focus on having small tools that show that the application can add tracks quickly and start playing them, filling in the more costly information later, and show that the GUI frontend can update these in realtime in the database view.

And, as usual, I like to shoehorn in a use for my python command class, so I'll be using that as a collection point for these little tools as I work my way up.

After plugging in the right plumbing, in twenty minutes I had this on top of my old code:

$ dad analyze level /mnt/nas/media/davedina/audio/albums/Nirvana\ -\ In\ Utero/Nirvana\ -\ All\ Apologies.ogg ** Message: pygobject_register_sinkfunc is deprecated (GstObject) Successfully analyzed file /mnt/nas/media/davedina/audio/albums/Nirvana - In Utero/Nirvana - All Apologies.ogg. 2 fragment(s) - fragment 0: 0:00:00.000000000 - 0:03:50.230204081 - peak 0.240 dB (105.672 %) - rms -14.199868248342282 dB - peak rms -8.913940439528652 dB - 95 percentile rms -12.001385041642244 dB - weighted rms -14.202287606952533 dB - weighted from 0:00:01.205986394 to 0:03:39.612879818 - fragment 1: 0:23:59.107482993 - 0:31:32.227482993 - peak 0.526 dB (112.876 %) - rms -14.742109190444983 dB - peak rms -8.729096757819718 dB - 95 percentile rms -11.56951163744373 dB - weighted rms -14.742603253857133 dB - weighted from 0:23:59.223582765 to 0:31:18.498684807

In case you were wondering, this shows the code correctly determining that the 'All Apologies' track on the In Utero CD contains in fact two songs. It always annoys the hell out of me when any of the music players I use doesn't play anything for 20 minutes just because Kurdt thought that would be amusing all those years ago.

(In case you were really astute, you may have noticed that this code claims that the peak of these fragments is over unity, which would be weird and wrong you would think. Monty could give you a long and interesting explanation on how that is in fact natural and every time I read it I still don't get it, even with my audio engineering background, and I still don't know if this apparent peak level is a bad thing, but in practice my playback code auto-levels anyway and consistently reduces volume on tracks, so I don't think it matters anyway...)

Comments (2)

Removing objects from running Python processes using GDB

Filed under: Flumotion,Python — Thomas @ 11:47

2011-08-05
11:47

This week at work we ran into a problem where one of our Python processes was consuming close to 3 GB of memory because it's not properly cleaning up a list. Because of other bugs this process could not be easily restarted without triggering other problems, so our core team asked for some suggestions and I told them "Why don't you try cleaning up the Python list using GDB and the Python C API ?" I had a vague recollection of someone on our team doing something like this a few years ago.

I also asked them to blog about it, because there aren't that many resources readily findable on the subject.

So here is Andoni's take on the problem.

If any Pythonista can suggest how he could have avoided the segfault during garbage collection, please let us know!

Comments (1)

Transcoding webinar

Filed under: Flumotion — Thomas @ 18:34

2011-06-30
18:34

I'm doing a webinar in a couple of minutes on Video In The Cloud: Live and On-demand Encoding and Delivery. A bit late to announce that if you want to catch it (I've been busy, sorry), but it will probably get recorded and be made available later on.

I'm there with people from Amazon, Sorenson and Zencoder, so that should give a good Q&A session.

It's a challenge doing that from Europe at the moment - I left the office past 23, and Neil even later - but it's a good opportunity to talk about the things we're working on.

Comments (4)

Present Perfect

Adventures in fingerprinting

2011-08-0920:55

Step 1

2011-08-0613:24

Digital Audio Database

01:07

Removing objects from running Python processes using GDB

2011-08-0511:47

Transcoding webinar

2011-06-3018:34

2011-08-09
20:55

2011-08-06
13:24

2011-08-05
11:47

2011-06-30
18:34