Summer of Code is nearing its end, so is my project.
After the improvements of the algorithms, comparison needed a bit of user interface. This was mainly prompted by the dilemma of what the basic unit of comparison should be - word or char. In the end, we decided to give the user a choice between the two and an auto mode (chosen by default). The user can also turn on/off the rsid functionality, as well as tune the output of the comparison by ignoring matched pieces of a given length.
Here is a brief documentation of the project and the latest patch.
Friday, August 14, 2009
Saturday, July 18, 2009
A bit of statistics
In the previous post I showed why having a measurment of similarity of strings that is both accurate and easy to compute fast is very important - without it the second step would fail and we wouldn't be able to match anything but paragraphs that haven't changed at all.
The following ratios measure to some degree the similiary between two paragraphs:
1) average length of the two paragraphs / length of lcs
2) average length / length of longest common substring
3) squared average length/ sum of squares of common substrings
Choosing the best among them as well as a reference constant required a bit of statistics which I, unable to find it online, did on my own.
I used paragraphs of various lengths (6 to 2268 characters), and various edit distances (0.29% to 56% contents changed - between couples of paragraphs meant to be matched)
I computed the above ratios for each pair of paragraphs (blue are the pairs of similar paragraphs, red are the different ones)
Here is the data; the second row histograms are zoomed-in views only including the leftmost 8% of the different paragraphs.
1) if we choose border value 2:
23% of the similar paragraphs are less than 25% under the limit
41% of the different paragraphs are less than 25% above the limit
2) if we choose border value 8:
8% of the similar paragraphs are less than 25% under the limit
1% of the different paragraphs are less than 25% above the limit
3) if we choose border value 30:
0% of the similar paragraphs are less than 25% under the limit
0% of the different paragraphs are less than 25% above the limit
The conclusion is:
1) inaccurate and slower
2) quite accurate and fastest
3) most accurate but slower
So a good trade off between time and accuracy is to use 2) for step 2 and 3) for step 3 (see previous post for the "steps")
The following ratios measure to some degree the similiary between two paragraphs:
1) average length of the two paragraphs / length of lcs
2) average length / length of longest common substring
3) squared average length/ sum of squares of common substrings
Choosing the best among them as well as a reference constant required a bit of statistics which I, unable to find it online, did on my own.
I used paragraphs of various lengths (6 to 2268 characters), and various edit distances (0.29% to 56% contents changed - between couples of paragraphs meant to be matched)
I computed the above ratios for each pair of paragraphs (blue are the pairs of similar paragraphs, red are the different ones)
Here is the data; the second row histograms are zoomed-in views only including the leftmost 8% of the different paragraphs.
1) if we choose border value 2:
23% of the similar paragraphs are less than 25% under the limit
41% of the different paragraphs are less than 25% above the limit
2) if we choose border value 8:
8% of the similar paragraphs are less than 25% under the limit
1% of the different paragraphs are less than 25% above the limit
3) if we choose border value 30:
0% of the similar paragraphs are less than 25% under the limit
0% of the different paragraphs are less than 25% above the limit
The conclusion is:
1) inaccurate and slower
2) quite accurate and fastest
3) most accurate but slower
So a good trade off between time and accuracy is to use 2) for step 2 and 3) for step 3 (see previous post for the "steps")
Tuesday, July 14, 2009
3 stages of comparison and 4 different algorithms (project status)
The big picture: when the document comparison function is called it performs 3 main steps:
1) It compares paragraphs and finds the ones that are identical in both documents. The current algorithm for this is quite good.
2) Then we deal with the paragraphs that are not matched (in between two sequences of matched paragraphs). So we have to match them in some way and compare them by character. Currently, they are just compared in order, which is highly inefficient and in some cases leads to auful results. I changed this to lcs of rsids (see previous post(s)). However, we can't count on always having rsids so another approach is necessary. Right now I'm thinking of running a "faster common sequence" algorithm on the new and old paragraphs in which the "==" is not only equivalent to "equal rsids" but to "some degree of character similarity", which in turn will be found by another(or the same) algorithm run on the characters.
Here the "faster common sequence" algorithms will be one (or two) of the following:
- The algorithm described in Heckel's article "A Technique for Isolating Differences Between Files" - linear
- A divide and conquer algorithm that is linearithmic in the worst case and linear in the average case
3) Finally, character lcs is run on the paragraphs matched in step 2.
I have almost completed writing up the algorithms and integrating them in the comparison function so now what remains is testing...lots of it... It turns out it is not that easy to find good tests on the internet... so writing essays for hum classes wasn't completely worthless after all.
When I have some nice test results (hopefully soon) I will publish them here.
1) It compares paragraphs and finds the ones that are identical in both documents. The current algorithm for this is quite good.
2) Then we deal with the paragraphs that are not matched (in between two sequences of matched paragraphs). So we have to match them in some way and compare them by character. Currently, they are just compared in order, which is highly inefficient and in some cases leads to auful results. I changed this to lcs of rsids (see previous post(s)). However, we can't count on always having rsids so another approach is necessary. Right now I'm thinking of running a "faster common sequence" algorithm on the new and old paragraphs in which the "==" is not only equivalent to "equal rsids" but to "some degree of character similarity", which in turn will be found by another(or the same) algorithm run on the characters.
Here the "faster common sequence" algorithms will be one (or two) of the following:
- The algorithm described in Heckel's article "A Technique for Isolating Differences Between Files" - linear
- A divide and conquer algorithm that is linearithmic in the worst case and linear in the average case
3) Finally, character lcs is run on the paragraphs matched in step 2.
I have almost completed writing up the algorithms and integrating them in the comparison function so now what remains is testing...lots of it... It turns out it is not that easy to find good tests on the internet... so writing essays for hum classes wasn't completely worthless after all.
When I have some nice test results (hopefully soon) I will publish them here.
Friday, July 3, 2009
Comparing paragraphs (lines)
Introducing the longest common subsequence in the ChangesInLine() method (which compares chars to find similarities between two paragraphs) was certainly an improvement to the document comparison function but there is still the question about comparing lines(paragraphs). The current implementation is pretty good at finding the lines that are identical in the two documents (the unchanged lines). However, it's not so good at matching the rest of the lines (the changed lines) which includes finding which lines were entirely inserted or deleted, and which were changed only a bit (then they are compared by char with ChangesInLine()). This's where the rsid is very helpful.
The idea is to use some comparison algorithm, such as lcs, on the paragraph rsids of two arrays of lines(paragraphs) that are inserted/deleted/changed. The ones that fall in the lcs will then be changed and we call ChangesInLine(). The rest are simply inserted or deleted. This happes in the CheckForChangesInLine() method. Currently, this method calls ChangesInLine() for the first lines in each array, then for the second lines and so on, and stops when two lines are completely different. Thus, if someone makes the following list of names:
and then someone adds three more names as well as family names:
the result of comparing the new document with the old is:
On the other hand, the improved lcs&rsid comparison gives:
The idea is to use some comparison algorithm, such as lcs, on the paragraph rsids of two arrays of lines(paragraphs) that are inserted/deleted/changed. The ones that fall in the lcs will then be changed and we call ChangesInLine(). The rest are simply inserted or deleted. This happes in the CheckForChangesInLine() method. Currently, this method calls ChangesInLine() for the first lines in each array, then for the second lines and so on, and stops when two lines are completely different. Thus, if someone makes the following list of names:
Theresa
Bernice
Kristin
Elsie
Lois
and then someone adds three more names as well as family names:
Theresa Maldonado
Bernice Cervantes
Michelle Rosario
Kristin Wiley
Elsie Howell
Amber Weiss
Laura Dickerson
Lois Burnett
the result of comparing the new document with the old is:
Theresa Maldonado
Bernice Cervantes
Michelle Rosario
Kristin Wiley
Elsie Howell
Amber Weiss
Laura Dickerson
Lois BurnettKristinElsieLois
On the other hand, the improved lcs&rsid comparison gives:
Theresa Maldonado
Bernice Cervantes
Michelle Rosario
Kristin Wiley
Elsie Howell
Amber Weiss
Laura Dickerson
Lois Burnett
Tuesday, June 30, 2009
Paragraph rsid?
Working on the comparison made me realize it would be good to also be able to compare paragraphs by rsid (independently of the rsid of text they contain) and I also noticed that the MS Word xml contains something like a paragraph rsid.
After a few days of exploring how the Ts and Ps in the ODF xml work, I wasn't able to set the rsid number independently on the paragraph and its text (some factors like automatic merging at saving and loading were preventing me from doing this, but I believe it is the actual purpose of the Ps and Ts that is making it impossible in the first place) So, the best option (for now) is to make a new rsid attribute and set it only on entire paragraphs:
The xml doesn't look (as beautifully) as I would want it to (if I manage to set the text rsid only on text) but it contains all the necessary information.
After a few days of exploring how the Ts and Ps in the ODF xml work, I wasn't able to set the rsid number independently on the paragraph and its text (some factors like automatic merging at saving and loading were preventing me from doing this, but I believe it is the actual purpose of the Ps and Ts that is making it impossible in the first place) So, the best option (for now) is to make a new rsid attribute and set it only on entire paragraphs:
[...]
<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="001a97ee" style:paragraph-rsid="001a97ee"/>
</style:style>
<style:style style:name="P2" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="001a97ee" style:paragraph-rsid="001b9628"/>
</style:style>
<style:style style:name="T1" style:family="text">
<style:text-properties style:rsid="001b9628"/>
</style:style>
</office:automatic-styles>
<office:body>
<office:text>
[...]
<text:p text:style-name="P1">1111111111111</text:p>
<text:p text:style-name="P2">
<text:span text:style-name="T1">22222222</text:span>
The xml doesn't look (as beautifully) as I would want it to (if I manage to set the text rsid only on text) but it contains all the necessary information.
Wednesday, June 24, 2009
continued
I use the rsid in a similar way when comparing chars in the ChangesInLine method, which looks for similarities between two lines (paragrphs) which are not identical. Currently it checks for identical chars in the beginning and end of the two lines, so if you have the text:
To deal with this I decided to use the longest common subsequence of the two paragraphs, which allows for more accurate tracking of changes/similarities. The above comparison now looks like that:
This is a paragraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, and does nothing in the 'middle'.and add "short" in the beginning, and delete "does" in the end, the result of comparing will be:
This is a short paragraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, andIf you just delete the first and last chars the function will say that there are no similarities, while if you have two completely different paragraphs with the same first letter it will not mark as them completely different as (i.e. ChangesInLine will return true ).paragraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, and doesnothing in the 'middle'.
To deal with this I decided to use the longest common subsequence of the two paragraphs, which allows for more accurate tracking of changes/similarities. The above comparison now looks like that:
This is a short paragraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, andIf there aren't enough similarities (the lcs will always contain some letters) I want the two paragraphs to be marked as different (i.e. ChangesInLine to return false) so I check if the length of the lcs is at least half of the length of the shorter paragraph or that the length of the longest continuous chunk of text in the lcs is at least ~10% of the length of the shorter paragraph.doesnothing in the 'middle'.
Tuesday, June 23, 2009
comparing with rsids
Now that the rsid is almost done (up to slight modifications and bugs fixing that will probably be necessary in the course of time) I turned to the document comparison itself.
The first more or less obvious use of the rsid is whenever two lines (text nodes) or two chars are compared. In addition to the condition that the two things being compared are the same I add the condition that their rsids are the same. This will ensure that the two things are not the same just by chance (which can easilly be the case when comparing chars or short and repetitive lines in the document) but because they were created in the same edit session.
Here is an example of how it goes in the CompareTxtNd function:
The rsid check is applicable only if the two documents originate from the same document i.e. if their root rsids are the same, which is easily checked with the handy getRsidRoot() method;
If the root rsids are not the same or the documents don't have rsids, the two rsids will be 0;
To get the rsids I made a method of SwTxtNode, GetRsid(start,end);
The first more or less obvious use of the rsid is whenever two lines (text nodes) or two chars are compared. In addition to the condition that the two things being compared are the same I add the condition that their rsids are the same. This will ensure that the two things are not the same just by chance (which can easilly be the case when comparing chars or short and repetitive lines in the document) but because they were created in the same edit session.
Here is an example of how it goes in the CompareTxtNd function:
A few things to note:
BOOL SwCompareLine::CompareTxtNd( const SwTxtNode& rDstNd,
const SwTxtNode& rSrcNd )
{
BOOL bRet = FALSE;
const SwDoc* pDestDoc = rDstNd.GetDoc();
const SwDoc* pSrcDoc = rSrcNd.GetDoc();
UINT32 nDstRsid = 0;
UINT32 nSrcRsid = 0;
// Only compare rsids if the rsidRoots are the same (docs have the same origin)
if ( pSrcDoc->getRsidRoot() == pDestDoc->getRsidRoot() )
{
nDstRsid = rDstNd.GetRsid( 0, rDstNd.Len() );
nSrcRsid = rSrcNd.GetRsid( 0, rSrcNd.Len() );
}
// erstmal ganz einfach!
if( rDstNd.GetTxt() == rSrcNd.GetTxt() && nDstRsid == nSrcRsid )
{
// der Text ist gleich, aber sind die "Sonderattribute" (0xFF) auch
// dieselben??
bRet = TRUE;
}
return bRet;
}
The rsid check is applicable only if the two documents originate from the same document i.e. if their root rsids are the same, which is easily checked with the handy getRsidRoot() method;
If the root rsids are not the same or the documents don't have rsids, the two rsids will be 0;
To get the rsids I made a method of SwTxtNode, GetRsid(start,end);
Subscribe to:
Posts (Atom)