Song Tagging and Metadata Blog

Saturday, 19 December 2009

Better matching in Jaikoz 3.4.2

Automatic matching is difficult but I've made a number of changes to improve the matching in Jaikoz in the latest release.

Jaikoz searches for possible matches in Musicbrainz then rescores them taking additional information into account to find the best match, it does this because an original Musicbrainz score only takes into account the search terms when scoring but we need to consider more values. For example we do not specify a duration in a search because some songs do not even have a duration within Musicbrainz so would never be returned by a search, but having got some potential results we want to give a higher score to those with a duration that matches the original song. Musicbrainz uses Lucene for searching with its own custom analyzer for deciding which songs are returned by a search and this latest release of Jaikoz uses the exact same analyzer to ensure scoring is compatible. This is one advantage of working on both Musicbrainz and Jaikoz !

When searching for a track we now consider more variations of the name because songs entered into Musicbrainz are normalized , for example We Have Explosive (Pt. 5) should be entered into Musicbrainz as We Have Explosive, Part 5 but they might not have been. This normalization is detailed in the Style Guidelines and In Jaikoz we now check for the title as it appears in your metadata and also as a normalized version as far as possible.

We also make workarounds for common errors in entering data. For example Musicbrainz Issue #5538 shows that users usually enter song titles as 'No. 1' , but in a large minority of cases enter 'No.1' , Jaikoz workrounds this issue.

Cluster Albums finds albums by artists with the same name but a different Release Id and tries to move the songs so that they are all on the same Release Id, note this is different to what 'Cluster' means in Musicbrainz Picard and perhaps I should have called it something different. Previously it did this by matching title against title for each Release Id being used, and picked the Release Id which had the most matches but now this has been improved. Firstly we use fuzzy matching on the title allowing for normalization as explained earlier. Secondly if all but a couple of tracks are successfully matched to one Release Id we allow matches on Acoustic Id and song length to shoehorn the remaining tracks into a potential release. This is really useful when the same song exists on two albums but is radically renamed between the two.

Tuesday, 15 December 2009

Musicbrainz Developments

I went to the yearly Musicbrainz summit in Nurenberg a couple of weeks ago, a good time was had by all , and the old town is certainly a lovely place

We discussed future developments, and their plenty of good news for Jaikoz users.

Musicbrainz were very happy with the work I'd done rewriting the existing search system and I am the main developer of the search for the new NGS release being developed so I'm able to understand how search works in Musicbrainz and improve it to suit Jaikoz better. This should ensure you get better results from Jaikoz then any other other tagger.

NGS has lots of new features including recognising the same recording on different albums, and handling multiple disc release better - this will help with getting better matching.

Work is going to be done on creating a genre system from the existing folksonomy cloud so this will help with a field that is currently quite poorly served by Musicbrainz. Knowing the attention to detail of Musicbrainz editors I'm hopeful that in time Musicbrainz many create the definitive genre list.

Historically Musicbrainz have been very cautious about having APIs to allow data to be added in to Musicbrainz. Sensibly they do not want it to end up a mess like freecddb but they have warmed to the idea now and I think this is a neccessary move to keep on top of the exposion in music being recorded. So in the New Year Im going to have a think about taking advantage of this loosening up of data entry.

And Musicbrainz have just hired their first fulltime developer Kuno Woudt , a well established Musicbrainz editor and developer so this should speed up Musicbrainz development.

Friday, 13 November 2009

New version of Jaudiotagger

I've just released the new version of Jaudiotagger, the tagging library used by Jaikoz and quite a few other applications. I've been happily removing code from the project and we now have a crisper, simpler API without any loss of functionality, and for good measure the repository has been converted from CVS to MVN.

Next question is should we move to Project Kenai

Wednesday, 23 September 2009

Better Searching in Jaikoz

I've been working hard with Musicbrainz over the last couple of months rewriting the Search Server to give better results, and I'm pleased to say its been a success and has now been released !

This will improve the results you get with any version of Jaikoz.

Wednesday, 19 August 2009

Fulfilling The Roadmap

Jaikoz 3.2.2 out now fixing the remaining problems in 3.2.0. 3.2 goes someway towards the roadmap with a simpler interface and simpler matching options. But I didn't manage the last option , more automated tests !

As a result there was a few issues with 3.2.0, so lesson learnt I do need more automated tests and I always should do a beta for these major releases !

I've also been busy rewriting the Musicbrainz Search Server, if all goes as plans this update should make it way onto Musicbrainz very soon.

Wednesday, 8 April 2009

Jaikoz Future Roadmap

Okay here are the priorities for the next few months.

1. Fix memory management so memory consumption isn't tied by number of songs loaded by use of database. This should allow you to load your complete collection if you wish AND subsequent restarts of Jaikoz should be able to use the cached data instead of reading from the file if it hasn't changed which will speed the initial file loading greatly.

2. Simplify Matching for better results. Jaikoz has many options to change how the matching works, but they are difficult to understand and some only make sense for individual files. Im going to drop some of these options and replace with ones that make more sense such as

Match the album that best matches existing metadata OR always prefer original albums even if better match with compilation

.

3. Improve results by improving Musicbrainz, Im going to work on the Musicbrainz Search server , fixing issues and improving performance.

4. Simplify Interface, new users find it difficult to understand the seperate tag and analyse tasks and the user interface. Im going to either simplify it further or provide a new Simple Mode, with the existing interface becoming Advanced Mode. Thinking along the lines of how Azereus retrofitted a new default interface when it becamwe Vuze.

5. Implement more automated tests for the GUI, I have many for the reading and writing to files but not for the Interface itself. I need to do this to prevent the occasional regression cropping up as it has in the past.

Jaikoz 2.9.2 available

Fixed a couple of problems including a regression in 2.9.0 wherby Jaikoz was incorrectly matching tracks with no album name with albums in the local cache causing a slow down for those of use with large caches.

Monday, 30 March 2009

Performance testing Musicbrainz Search Index

Musicbrainz have been looking at improving the performance of their search server. When search queries are sent to Musicbrainz they do not access the database, instead they access a Search Index that is built from the database and contains the information to be searched and allows full test search and other features not usually available from a database.

Currently Lucene is used, but is accessed using Python with as pylucene. In an attempt to boost performance a simplified version of the search code was developed using pure Java available from here, but this still didn't seem to be giving the required performance enhancements.

Ive tried some tuning mechanisms and code changes to see if I can make some improvements or at least get some benchmark figures.

The tests were performed on a MacBook Pro 2.66Ghz Core 2 Duo with 4GB of 1067 DDR3 RAM using OSX 10.5, So it is a good lab top but doesn't compare so well with a desktop of server. (I used a Macbook because it was 64bit so I could use 64 bit Java to address more than 2GB of memory whereas my Window PC is not, and I have no native linux desktop).

Summary of results:
Track OR query:Single Threaded test with insufficient memory, index on hard drive :1.91 query /sec
Track OR query:Single Threaded test with enough memory, index on hard drive :24.66 track query /sec
Track OR query:Best Multi Threaded test :43.96 queries /sec
Track query:Best Multi Threaded test :59.47 queries /sec
Release query:Threaded test :252.75 queries /sec

I created a test set of 10,011 track titles and 10,011 releases from the database, and I rebuilt the indexes with StopWords Filter removed to solve the stop words bug. I then created a test program to fire requests to the jetty (servlet searcher) from multiple threads, but I was getting inconsistent results and with Jetty not being the thing that was being tested I removed this from the equation.

I restarted by creating a test program that loaded pairs of track/release records into a queue, and then creating a number of threads to read the next pair of the queue and then send these directly to a Search class, this would perform the search and find the best hits and then return, the test results are for this setup. I found myself running a few tests, then making some more code changes to see if it made any difference, this is what I found:

Index Directory Location:
The macbook has a SSD drive, and external hard drive, and of course the index could also be loaded in memory. I expected loading the index into memory would give vastly superior performance but when I tried loading the Track Index (about 2.5 GB) the performance was terrible, I concluded that it must be swapping memory because of the memory required by the OS itself. I did some with the smaller Release Index as well, here the RAM Directory performed as well as the others but not any better. The SSD performed much better than the hard drive when I ran tests with insufficient memory, but when enough memory was allocated there was little difference between the two.

Memory allocated to JVM: With the default (of 64MB) it performed very badly on the hard drive, but when I increased it to 2GB there was a big improvement, but further adjustments (up and down) didn't make much difference. So it seems you need a reasonable but not ridiculous memory for decent results.

Code Improvements: Posted a few questions on the lucene mailing list and was given some optimizations for how the index is opened
new IndexSearcher(IndexReader.open(new NIODSDirectory(new File indexDir + "track_Index"),null),true)));
and on iterating through the results.
Opening using the index using this new method doubled the number of queries that could be processed by reducing contention on the index searcher, I haven't yet tried the iterating query improvements.

No of Threads:I ran tests using just a single thread at first then increased the number of threads to find the optimum throughput. With the code improvement in place I found you needed at least 30 threads for the best results , but additional threads didn't give further improvements. If the tests were performed on a Quad CPU or better I expect more threads would give more gains.

Query type:I tried an OR and a simple query against the track index
type=track&query=track:"trackname" OR release:"releasename"
type=track&query=track:"trackname"
Of course the OR query was slower , but not exponentially so - it was about 30% slower.

I then tried querying the release index, because it is much smaller the results were much better, with a 400% improvement in speed.

The full results are available here

and the amended zip of the code can be found here.

Wednesday, 25 March 2009

Unable to find 'is this it' album bug solved - almost

There is a longstanding bug in Musicbrainz that makes it difficult to find songs that contain a number of common stop words such as the,is,that,a . This is because these stop words have been removed from the search index so do not count towards a match. Album such as 'is this it?' by the Strokes have a real problem because it ONLY contains stop words.

I suggested a fix some time ago which didnt get acted on. Ive now implemented the fix successfully on a pure Java development server results at http://www.jthink.net/jaikoz/scratch/isthatitsearch.jpg. I need to reimplement in the existing code base, then hopefully this will prove the fix and get it rolled out.

Jaikoz 2.9 released

The Export feature has been added, and I think this could be very useful for some of you power users. Remember you can use it for:
backing up your metadata
editing of metadata within a spreadsheet
moving metadata from one file to another
sharing your song list with other applications or users

I wait for feedback on what you think of it and what uses you make of it.

Ive also been trawling through the bugs list trying to solve some long standing bugs that may have been forgotten. Im happy with results and aim to have Jaikoz essentially bug free within a couple of releases, of course there is still the ever increasing enhancements list....

Tuesday, 17 February 2009

Export Songs to Spreadsheet

Working on an Export feature which is simple and effective. It will allow you to export the details of your loaded Songs to a comma seperated file. You can then use this file as an archive of your metadata AND you can also open and edit values within a proper spreadsheet application and then import the changes back into Jaikoz.

So its give you tag backup and mega editing capabilities in one go, and you can also use the file created to share your songs list with friends or to create playlists.

There are a few decisions to be made yet though.
1. The export feature only works on the editable fields common to all formats so fields not supported by Jaikoz or only supported in the ID3 tabs view are not exported.
2. Artwork is not exported, its not appropriate to store large binary fields in this sort of file but I know artwork is very important to people so maybe it can be shoehorned into this feature somehow, or should I just have a seperate 'Export Artwork' that could export artwork either on a folder by folder by basis so its kept with the files or all lumped into a single folder.
3. The export only supports single instances of fields, so for example it would only export one genre per song.
4. Because not desirable to load all songs into Jaikoz in one go if you select a file that already exists Jaikoz should append the new entries, but would need to overwrite an entry if it already existed for the same file.
5. The first column of the created file would be the full filename, so that Import can work by matching the filename with a file open in Jaikoz and then update accordingly.
6. You might have two versions of a song , a flac version and a mp3 version and want to import metadata from the mp3 version to the flac file you would just have to edit the filename in the csv file.
7. If when exporting you are are replacing existing songs, should they be replaced in the same place, or afterwards. Would it be better to always sort the file alphabetically.
8. The data is encoded using utf8, this fully supports Unicode so all characters can be encoded and also it is economical with memory - only one byte is used for ascii chars. The only problem is that it is only the default encoding on Linux, so might not be the default choice when the csv file is open with some applications. For example Open Office on Windows Vista assumes that the encoding is windows-1252 , you have to tell it to use utf8.
9. In the future would also like to alow export to an xml format but xml not terribly useful for editing, this would also provide a solution for (3).
10. Could also create native spreadsheet formats such as .xls or .ods which is slightly more user friendly, but I dont think the extra effort involved is worth it at the moment.

Bugs, Enhancements or Testing

Jaikoz 2.8.4 now out with a few enhancements and a host of bugs fixes, some introduced in earlier versions, and there lies the rub. I've been concentrating on bug fixes recently and small enhancements rather than new features but the regressions came about due to not enough testing.

It is very difficult to get the balance right, I have automated tests that cover the reading and writing of metadata but not automated tests for the Jaikoz GUI itself.

So do I spend my time writing more tests, fixing problems or adding new features ?

I think the correct answer is to continue with all three, and ensure I do beta releases of all major releases, but has anyone else got any other views.

Monday, 16 February 2009

Opening Playlist Contents

In Jaikoz 2.8.2 I added support for opening Winamp Playlists, the most widely accepted playlist format. But I knew that what many of you really wanted is to drag and drop iTunes playlists, well you can - nearly.

The difficulties with iTune playlists are twofold. Firstly there is no playlist file so you have to communicate with iTunes instead of just reading the file. Secondly talking to iTunes is very different on Windows than on OSX , so the work effort doubles.

I now have dragging playlists from iTunes working beautifully on Windows and this will be available in Jaikoz 2.8.4 out today or tomorrow.

But OSX does not allow the playlist to be dragged to anywhere except iTunes itself (please anyone correct me if Im wrong) so it is impossible to even initiate a playlist drop. Instead I'll be creating some scripts for the iTunes Script Menu to allow you to send playlists to iTunes.