Tag Archive for 'sync'

How to Not Suck at Transcoding Music

Sometimes I get really interested in something and can’t sleep and end up spending all night working on it. This was one of those nights.

I wrote flacsync a while ago in order to make an MP3 with suitable tags for every FLAC I own. I do this so that I can actually play my favorite music on my iPhone and also so that I can keep a lightweight copy of my music database on my laptop, which has far less space than my desktop.

The idea was to keep a database containing an MD5 hash for every FLAC I have and whenever I run the script, check that hash to see if the FLAC has changed and needs to be re-transcoded. If so, transcode it and store the new hash to the database.

I made some remarkably stupid initial design choices. I knew that I wanted to thread it in order to maximize throughput, but I had some ridiculous bottlenecks. For example, for some reason, I thought it would be a good idea to use a SQLite database to store the hashes and then have a complicated DBWorker thread that would interface with all the processing threads. Although I got this original design working, it was slow. It took maybe 30 minutes to run through all my FLACs.

I later redesigned the script to just load a Python dictionary containing all the hashes from a file into memory. Then, I could update the table freely and wouldn’t even really need to worry about locking since only one thread acted on a track.

But this was still slow for a few reasons:

  • I wasn’t ordering the list of files intelligently at all. It would make sense to try to process the most recently changed files first, wouldn’t it?
  • Python’s [cci]threading[/cci] module isn’t actually capable of performing tasks on multiple processors. It can still only perform on at most one core.
  • Running [cci]md5sum[/cci] on an entire FLAC is slow and therefore dumb.

So, I redesigned the whole script to use the [cci]multiprocessing[/cci] module’s [cci]Pool[/cci] abstraction where you can apply a function onto a list with a pool of workers and then gather the results. Now, each worker returns either an indication that the file didn’t change or a new hash for the file. The system tallies up all the new hashes at the end, updates the database, saves it to disk, and exits. Oh and when it first starts up and finds all the FLACs in my music directory, it sorts them so that the most recently changed files are first.

Moreover, I was lazy in that I was just calling the [cci]md5sum[/cci] program on the entire FLAC, so I used Python’s [cci]hashlib[/cci] module to only take the MD5 of the first 4096 bytes of the FLAC. This is pretty okay because the header information is almost always entirely contained there.

The result is that I can fly through my entire music library in like a second (okay some of that is coming from disk cache — I haven’t tried it on a cold boot yet). Transcoding on my computer (piping [cci]flac[/cci] to [cci]lame[/cci] and then copying tags over) takes about 15 seconds on average.

So 30 minutes just to check a already-sync’d database to a few seconds. Pretty good speedup.

I guess now I should go to work.

flacsync: Automatically Sync FLACs to MP3s

In an effort to make organizing my music collection suck less, I wrote a script to automate the process of converting my FLAC files to MP3s for when I want to listen to them with my MP3 player.

Previously, I had no way to do this in an automated way. My music collection changes rapidly, so I needed a way to convert FLACs that haven’t yet been converted (or have changed since the last conversion) to MP3s. The first step of this process is to decode the FLAC and then pipe that to an MP3 encoder.

Next, I had to extract as much tag information as I could out of the FLAC and convert it into ID3 tags. Finally, if there are image files in the directory containing the FLACs, the script automatically embeds those images into the MP3 file (and tries to determine what type of image they are).

It also has a configurable number of worker threads, so you can process the files in parallel. It keeps a database of hashed files in ~/.flacsync/db, so when you re-run it, it will only retranscode new or changed FLAC files.

You can find it here. It uses some Unix commands like find. It requires Python 2, Mutagen, FLAC, and LAME.