Tuesday, July 14, 2009

3 stages of comparison and 4 different algorithms (project status)

The big picture: when the document comparison function is called it performs 3 main steps:

1) It compares paragraphs and finds the ones that are identical in both documents. The current algorithm for this is quite good.

2) Then we deal with the paragraphs that are not matched (in between two sequences of matched paragraphs). So we have to match them in some way and compare them by character. Currently, they are just compared in order, which is highly inefficient and in some cases leads to auful results. I changed this to lcs of rsids (see previous post(s)). However, we can't count on always having rsids so another approach is necessary. Right now I'm thinking of running a "faster common sequence" algorithm on the new and old paragraphs in which the "==" is not only equivalent to "equal rsids" but to "some degree of character similarity", which in turn will be found by another(or the same) algorithm run on the characters.

Here the "faster common sequence" algorithms will be one (or two) of the following:
- The algorithm described in Heckel's article "A Technique for Isolating Differences Between Files" - linear
- A divide and conquer algorithm that is linearithmic in the worst case and linear in the average case

3) Finally, character lcs is run on the paragraphs matched in step 2.

I have almost completed writing up the algorithms and integrating them in the comparison function so now what remains is testing...lots of it... It turns out it is not that easy to find good tests on the internet... so writing essays for hum classes wasn't completely worthless after all.

When I have some nice test results (hopefully soon) I will publish them here.


  1. you have a nice site. thanks for sharing this site. you can download lots of ebook from here