Sometimes I get really interested in something and can’t sleep and end up spending all night working on it. This was one of those nights.
I wrote flacsync a while ago in order to make an MP3 with suitable tags for every FLAC I own. I do this so that I can actually play my favorite music on my iPhone and also so that I can keep a lightweight copy of my music database on my laptop, which has far less space than my desktop.
The idea was to keep a database containing an MD5 hash for every FLAC I have and whenever I run the script, check that hash to see if the FLAC has changed and needs to be re-transcoded. If so, transcode it and store the new hash to the database.
I made some remarkably stupid initial design choices. I knew that I wanted to thread it in order to maximize throughput, but I had some ridiculous bottlenecks. For example, for some reason, I thought it would be a good idea to use a SQLite database to store the hashes and then have a complicated DBWorker thread that would interface with all the processing threads. Although I got this original design working, it was slow. It took maybe 30 minutes to run through all my FLACs.
I later redesigned the script to just load a Python dictionary containing all the hashes from a file into memory. Then, I could update the table freely and wouldn’t even really need to worry about locking since only one thread acted on a track.
But this was still slow for a few reasons:
- I wasn’t ordering the list of files intelligently at all. It would make sense to try to process the most recently changed files first, wouldn’t it?
- Python’s [cci]threading[/cci] module isn’t actually capable of performing tasks on multiple processors. It can still only perform on at most one core.
- Running [cci]md5sum[/cci] on an entire FLAC is slow and therefore dumb.
So, I redesigned the whole script to use the [cci]multiprocessing[/cci] module’s [cci]Pool[/cci] abstraction where you can apply a function onto a list with a pool of workers and then gather the results. Now, each worker returns either an indication that the file didn’t change or a new hash for the file. The system tallies up all the new hashes at the end, updates the database, saves it to disk, and exits. Oh and when it first starts up and finds all the FLACs in my music directory, it sorts them so that the most recently changed files are first.
Moreover, I was lazy in that I was just calling the [cci]md5sum[/cci] program on the entire FLAC, so I used Python’s [cci]hashlib[/cci] module to only take the MD5 of the first 4096 bytes of the FLAC. This is pretty okay because the header information is almost always entirely contained there.
The result is that I can fly through my entire music library in like a second (okay some of that is coming from disk cache — I haven’t tried it on a cold boot yet). Transcoding on my computer (piping [cci]flac[/cci] to [cci]lame[/cci] and then copying tags over) takes about 15 seconds on average.
So 30 minutes just to check a already-sync’d database to a few seconds. Pretty good speedup.
I guess now I should go to work.