Saturday, July 18, 2009

A bit of statistics

In the previous post I showed why having a measurment of similarity of strings that is both accurate and easy to compute fast is very important - without it the second step would fail and we wouldn't be able to match anything but paragraphs that haven't changed at all.

The following ratios measure to some degree the similiary between two paragraphs:

1) average length of the two paragraphs / length of lcs
2) average length / length of longest common substring
3) squared average length/ sum of squares of common substrings

Choosing the best among them as well as a reference constant required a bit of statistics which I, unable to find it online, did on my own.

I used paragraphs of various lengths (6 to 2268 characters), and various edit distances (0.29% to 56% contents changed - between couples of paragraphs meant to be matched)

I computed the above ratios for each pair of paragraphs (blue are the pairs of similar paragraphs, red are the different ones)

Here is the data; the second row histograms are zoomed-in views only including the leftmost 8% of the different paragraphs.
1) if we choose border value 2:
23% of the similar paragraphs are less than 25% under the limit
41% of the different paragraphs are less than 25% above the limit

2) if we choose border value 8:
8% of the similar paragraphs are less than 25% under the limit
1% of the different paragraphs are less than 25% above the limit

3) if we choose border value 30:
0% of the similar paragraphs are less than 25% under the limit
0% of the different paragraphs are less than 25% above the limit

The conclusion is:

1) inaccurate and slower
2) quite accurate and fastest
3) most accurate but slower

So a good trade off between time and accuracy is to use 2) for step 2 and 3) for step 3 (see previous post for the "steps")

Tuesday, July 14, 2009

3 stages of comparison and 4 different algorithms (project status)

The big picture: when the document comparison function is called it performs 3 main steps:

1) It compares paragraphs and finds the ones that are identical in both documents. The current algorithm for this is quite good.

2) Then we deal with the paragraphs that are not matched (in between two sequences of matched paragraphs). So we have to match them in some way and compare them by character. Currently, they are just compared in order, which is highly inefficient and in some cases leads to auful results. I changed this to lcs of rsids (see previous post(s)). However, we can't count on always having rsids so another approach is necessary. Right now I'm thinking of running a "faster common sequence" algorithm on the new and old paragraphs in which the "==" is not only equivalent to "equal rsids" but to "some degree of character similarity", which in turn will be found by another(or the same) algorithm run on the characters.

Here the "faster common sequence" algorithms will be one (or two) of the following:
- The algorithm described in Heckel's article "A Technique for Isolating Differences Between Files" - linear
- A divide and conquer algorithm that is linearithmic in the worst case and linear in the average case

3) Finally, character lcs is run on the paragraphs matched in step 2.

I have almost completed writing up the algorithms and integrating them in the comparison function so now what remains is testing...lots of it... It turns out it is not that easy to find good tests on the internet... so writing essays for hum classes wasn't completely worthless after all.

When I have some nice test results (hopefully soon) I will publish them here.

Friday, July 3, 2009

Comparing paragraphs (lines)

Introducing the longest common subsequence in the ChangesInLine() method (which compares chars to find similarities between two paragraphs) was certainly an improvement to the document comparison function but there is still the question about comparing lines(paragraphs). The current implementation is pretty good at finding the lines that are identical in the two documents (the unchanged lines). However, it's not so good at matching the rest of the lines (the changed lines) which includes finding which lines were entirely inserted or deleted, and which were changed only a bit (then they are compared by char with ChangesInLine()). This's where the rsid is very helpful.

The idea is to use some comparison algorithm, such as lcs, on the paragraph rsids of two arrays of lines(paragraphs) that are inserted/deleted/changed. The ones that fall in the lcs will then be changed and we call ChangesInLine(). The rest are simply inserted or deleted. This happes in the CheckForChangesInLine() method. Currently, this method calls ChangesInLine() for the first lines in each array, then for the second lines and so on, and stops when two lines are completely different. Thus, if someone makes the following list of names:

Theresa
Bernice
Kristin
Elsie
Lois

and then someone adds three more names as well as family names:

Theresa Maldonado
Bernice Cervantes
Michelle Rosario
Kristin Wiley
Elsie Howell
Amber Weiss
Laura Dickerson
Lois Burnett

the result of comparing the new document with the old is:

Theresa Maldonado
Bernice Cervantes
Michelle Rosario
Kristin Wiley
Elsie Howell
Amber Weiss
Laura Dickerson
Lois Burnett
Kristin
Elsie
Lois

On the other hand, the improved lcs&rsid comparison gives:

Theresa Maldonado
Bernice Cervantes
Michelle Rosario
Kristin Wiley
Elsie Howell
Amber Weiss
Laura Dickerson
Lois Burnett