Google Summer of Code: go-oo Compare: 2009

Friday, August 14, 2009

Summer of Code is nearing its end, so is my project.

After the improvements of the algorithms, comparison needed a bit of user interface. This was mainly prompted by the dilemma of what the basic unit of comparison should be - word or char. In the end, we decided to give the user a choice between the two and an auto mode (chosen by default). The user can also turn on/off the rsid functionality, as well as tune the output of the comparison by ignoring matched pieces of a given length.

Here is a brief documentation of the project and the latest patch.

Saturday, July 18, 2009

A bit of statistics

In the previous post I showed why having a measurment of similarity of strings that is both accurate and easy to compute fast is very important - without it the second step would fail and we wouldn't be able to match anything but paragraphs that haven't changed at all.

The following ratios measure to some degree the similiary between two paragraphs:

1) average length of the two paragraphs / length of lcs
2) average length / length of longest common substring
3) squared average length/ sum of squares of common substrings

Choosing the best among them as well as a reference constant required a bit of statistics which I, unable to find it online, did on my own.

I used paragraphs of various lengths (6 to 2268 characters), and various edit distances (0.29% to 56% contents changed - between couples of paragraphs meant to be matched)

I computed the above ratios for each pair of paragraphs (blue are the pairs of similar paragraphs, red are the different ones)

Here is the data; the second row histograms are zoomed-in views only including the leftmost 8% of the different paragraphs.

1) if we choose border value 2:
23% of the similar paragraphs are less than 25% under the limit
41% of the different paragraphs are less than 25% above the limit

2) if we choose border value 8:
8% of the similar paragraphs are less than 25% under the limit
1% of the different paragraphs are less than 25% above the limit

3) if we choose border value 30:
0% of the similar paragraphs are less than 25% under the limit
0% of the different paragraphs are less than 25% above the limit

The conclusion is:

1) inaccurate and slower
2) quite accurate and fastest
3) most accurate but slower

So a good trade off between time and accuracy is to use 2) for step 2 and 3) for step 3 (see previous post for the "steps")

Tuesday, July 14, 2009

3 stages of comparison and 4 different algorithms (project status)

The big picture: when the document comparison function is called it performs 3 main steps:

1) It compares paragraphs and finds the ones that are identical in both documents. The current algorithm for this is quite good.

2) Then we deal with the paragraphs that are not matched (in between two sequences of matched paragraphs). So we have to match them in some way and compare them by character. Currently, they are just compared in order, which is highly inefficient and in some cases leads to auful results. I changed this to lcs of rsids (see previous post(s)). However, we can't count on always having rsids so another approach is necessary. Right now I'm thinking of running a "faster common sequence" algorithm on the new and old paragraphs in which the "==" is not only equivalent to "equal rsids" but to "some degree of character similarity", which in turn will be found by another(or the same) algorithm run on the characters.

Here the "faster common sequence" algorithms will be one (or two) of the following:
- The algorithm described in Heckel's article "A Technique for Isolating Differences Between Files" - linear
- A divide and conquer algorithm that is linearithmic in the worst case and linear in the average case

3) Finally, character lcs is run on the paragraphs matched in step 2.

I have almost completed writing up the algorithms and integrating them in the comparison function so now what remains is testing...lots of it... It turns out it is not that easy to find good tests on the internet... so writing essays for hum classes wasn't completely worthless after all.

When I have some nice test results (hopefully soon) I will publish them here.

Friday, July 3, 2009

Comparing paragraphs (lines)

Introducing the longest common subsequence in the ChangesInLine() method (which compares chars to find similarities between two paragraphs) was certainly an improvement to the document comparison function but there is still the question about comparing lines(paragraphs). The current implementation is pretty good at finding the lines that are identical in the two documents (the unchanged lines). However, it's not so good at matching the rest of the lines (the changed lines) which includes finding which lines were entirely inserted or deleted, and which were changed only a bit (then they are compared by char with ChangesInLine()). This's where the rsid is very helpful.

The idea is to use some comparison algorithm, such as lcs, on the paragraph rsids of two arrays of lines(paragraphs) that are inserted/deleted/changed. The ones that fall in the lcs will then be changed and we call ChangesInLine(). The rest are simply inserted or deleted. This happes in the CheckForChangesInLine() method. Currently, this method calls ChangesInLine() for the first lines in each array, then for the second lines and so on, and stops when two lines are completely different. Thus, if someone makes the following list of names:

Theresa
Bernice
Kristin
Elsie
Lois

and then someone adds three more names as well as family names:

Theresa Maldonado
Bernice Cervantes
Michelle Rosario
Kristin Wiley
Elsie Howell
Amber Weiss
Laura Dickerson
Lois Burnett

the result of comparing the new document with the old is:

Theresa Maldonado
Bernice Cervantes
Michelle Rosario
Kristin Wiley
Elsie Howell
Amber Weiss
Laura Dickerson
Lois Burnett
~~Kristin~~
~~Elsie~~
~~Lois~~

On the other hand, the improved lcs&rsid comparison gives:

Theresa Maldonado
Bernice Cervantes
Michelle Rosario
Kristin Wiley
Elsie Howell
Amber Weiss
Laura Dickerson
Lois Burnett

Tuesday, June 30, 2009

Paragraph rsid?

Working on the comparison made me realize it would be good to also be able to compare paragraphs by rsid (independently of the rsid of text they contain) and I also noticed that the MS Word xml contains something like a paragraph rsid.

After a few days of exploring how the Ts and Ps in the ODF xml work, I wasn't able to set the rsid number independently on the paragraph and its text (some factors like automatic merging at saving and loading were preventing me from doing this, but I believe it is the actual purpose of the Ps and Ts that is making it impossible in the first place) So, the best option (for now) is to make a new rsid attribute and set it only on entire paragraphs:


[...]
<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="001a97ee" style:paragraph-rsid="001a97ee"/>
</style:style>
<style:style style:name="P2" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="001a97ee" style:paragraph-rsid="001b9628"/>
</style:style>
<style:style style:name="T1" style:family="text">
<style:text-properties style:rsid="001b9628"/>
</style:style>
</office:automatic-styles>
<office:body>
<office:text>
[...]
<text:p text:style-name="P1">1111111111111</text:p>
<text:p text:style-name="P2">
<text:span text:style-name="T1">22222222</text:span>

The xml doesn't look (as beautifully) as I would want it to (if I manage to set the text rsid only on text) but it contains all the necessary information.

Wednesday, June 24, 2009

continued

I use the rsid in a similar way when comparing chars in the ChangesInLine method, which looks for similarities between two lines (paragrphs) which are not identical. Currently it checks for identical chars in the beginning and end of the two lines, so if you have the text:

This is a paragraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, and does nothing in the 'middle'.

and add "short" in the beginning, and delete "does" in the end, the result of comparing will be:

This is a short paragraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, andparagraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, and does nothing in the 'middle'.

If you just delete the first and last chars the function will say that there are no similarities, while if you have two completely different paragraphs with the same first letter it will not mark as them completely different as (i.e. ChangesInLine will return true ).

To deal with this I decided to use the longest common subsequence of the two paragraphs, which allows for more accurate tracking of changes/similarities. The above comparison now looks like that:

This is a short paragraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, and ~~does~~ nothing in the 'middle'.

If there aren't enough similarities (the lcs will always contain some letters) I want the two paragraphs to be marked as different (i.e. ChangesInLine to return false) so I check if the length of the lcs is at least half of the length of the shorter paragraph or that the length of the longest continuous chunk of text in the lcs is at least ~10% of the length of the shorter paragraph.

Tuesday, June 23, 2009

comparing with rsids

Now that the rsid is almost done (up to slight modifications and bugs fixing that will probably be necessary in the course of time) I turned to the document comparison itself.

The first more or less obvious use of the rsid is whenever two lines (text nodes) or two chars are compared. In addition to the condition that the two things being compared are the same I add the condition that their rsids are the same. This will ensure that the two things are not the same just by chance (which can easilly be the case when comparing chars or short and repetitive lines in the document) but because they were created in the same edit session.

Here is an example of how it goes in the CompareTxtNd function:


BOOL SwCompareLine::CompareTxtNd( const SwTxtNode& rDstNd,
        const SwTxtNode& rSrcNd )
{
BOOL bRet = FALSE;

const SwDoc* pDestDoc = rDstNd.GetDoc();
const SwDoc* pSrcDoc = rSrcNd.GetDoc();

UINT32 nDstRsid = 0;
UINT32 nSrcRsid = 0;

// Only compare rsids if the rsidRoots are the same (docs have the same origin)
if ( pSrcDoc->getRsidRoot() == pDestDoc->getRsidRoot() )
{
 nDstRsid = rDstNd.GetRsid( 0, rDstNd.Len() );
 nSrcRsid = rSrcNd.GetRsid( 0, rSrcNd.Len() );
}

// erstmal ganz einfach!
if( rDstNd.GetTxt() == rSrcNd.GetTxt() && nDstRsid == nSrcRsid )
{
// der Text ist gleich, aber sind die "Sonderattribute" (0xFF) auch
// dieselben??
bRet = TRUE;
}
return bRet;
}

A few things to note:
The rsid check is applicable only if the two documents originate from the same document i.e. if their root rsids are the same, which is easily checked with the handy getRsidRoot() method;
If the root rsids are not the same or the documents don't have rsids, the two rsids will be 0;
To get the rsids I made a method of SwTxtNode, GetRsid(start,end);

Thursday, June 18, 2009

More rsid

The main functionality of rsid is now ready.You can download a patch with the current changes here.

I had to handle the various ways text can be inserted in the document since every time different functions are called (e.g. when just writing text, pasting it from the same document, pasting from an external source and splitting a paragraph in two.).

Rsids are now being saved as 32 bit hexadecimal numbers in the content.xml of the document, and, following the OOXML standard, each rsid is greater than the previous (the one generated in the previous session). The session rsid as well as the root rsid (the one generated when the original document was created) are saved as document settings in the settings.xml (they are not yet in hex).

Here is an example of how it works:

I created a document with the following text:

This is the original text of a document that has three paragraphs.
Here is the second paragraph of the original text.
Here is the third paragraph of the original text.

And this is the content.xml:


[...]
<office:automatic-styles>
<style:style style:name="T1" style:family="text">
<style:text-properties style:rsid="000134cc"/>
</style:style>
</office:automatic-styles>
<office:body>
<office:text>
[...]
<text:p text:style-name="Standard">
<text:span text:style-name="T1">
This is the original text of a document that has three paragraphs.
</text:span>
</text:p>
<text:p text:style-name="Standard">
<text:span text:style-name="T1">Here is the second paragraph of the original text.</text:span>
</text:p>
<text:p text:style-name="Standard">
<text:span text:style-name="T1">Here is the third paragraph of the original text.</text:span>
</text:p>
</office:text>

in the settings.xml I have the following two lines:

<config:config-item config:name="Rsid" config:type="int">79052</config:config-item>
<config:config-item config:name="RsidRoot" config:type="int">79052</config:config-item>

Then I split the third paragraph:

This is the original text of a document that has three paragraphs.
Here is the second paragraph and some text from the second session of the original text.
Here is the third paragraph
of the original text.


[...]
<office:automatic-styles>
<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="000134cc"/>
</style:style>
<style:style style:name="T1" style:family="text">
<style:text-properties style:rsid="00202baa"/>
</style:style>
</office:automatic-styles>
<office:body>
<office:text>
[...]
<text:p text:style-name="P1">
This is the original text of a document that has three paragraphs.
</text:p>
<text:p text:style-name="P1">
Here is the second paragraph
<text:span text:style-name="T1">and some text from the second session</text:span>
of the original text.
</text:p>
<text:p text:style-name="P1">Here is the third paragraph </text:p>
<text:p text:style-name="P1">
<text:span text:style-name="T1">of the original text.</text:span>
</text:p>
</office:text>

and in the settings:

<config:config-item config:name="Rsid" config:type="int">2108330</config:config-item>
<config:config-item config:name="RsidRoot" config:type="int">79052</config:config-item>

Finally, I copied some text from the first two paragraphs and pasted it in the middle of the third:

This is the original text of a document that has three paragraphs.
Here is the second paragraph and some text from the second session of the original text.
Here is the original text of a document that has three paragraphs.
Here is the third paragraph
of the original text.

[...]
<office:automatic-styles>
<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="000134cc"/>
</style:style>
<style:style style:name="P2" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="00202baa"/>
</style:style>
<style:style style:name="P3" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="000134cc"/>
</style:style>
<style:style style:name="T1" style:family="text">
<style:text-properties style:rsid="00202baa"/>
</style:style>
<style:style style:name="T2" style:family="text">
<style:text-properties style:rsid="003eb852"/>
</style:style>
</office:automatic-styles>
<office:body>
<office:text>
[...]
<text:p text:style-name="P1">
This is the original text of a document that has three paragraphs.
</text:p>
<text:p text:style-name="P1">
Here is the second paragraph
<text:span text:style-name="T1">and some text from the second session</text:span>
of the original text.
</text:p>
<text:p text:style-name="P1">
Here is the
<text:span text:style-name="T2">
original text of a document that has three paragraphs.
</text:span>
</text:p>
<text:p text:style-name="P1">
<text:span text:style-name="T2">Here is the</text:span>
third paragraph
</text:p>
<text:p text:style-name="P2">of the original text.</text:p>
</office:text>

and

<config:config-item config:name="Rsid" config:type="int">4110418</config:config-item>
<config:config-item config:name="RsidRoot" config:type="int">79052</config:config-item>

Monday, June 15, 2009

Saving works!

I managed to save the new rsid attribute in ODF i.e. now, I am able to see the rsid values in the content.xml of my .odt files. It is now saved as a signed decimal style but this will be revised soon. Here is a simple example:


<office:document-content office:version="1.2">
<office:scripts/>

<office:font-face-decls>
<style:font-face style:name="Liberation Serif" svg:font-family="'Liberation Serif'" style:font-family-generic="roman" style:font-pitch="variable"/>
<style:font-face style:name="Liberation Sans" svg:font-family="'Liberation Sans'" style:font-family-generic="swiss" style:font-pitch="variable"/>
<style:font-face style:name="Arial" svg:font-family="Arial" style:font-family-generic="system" style:font-pitch="variable"/>
</office:font-face-decls>

<office:automatic-styles>

<style:style style:name="T1" style:family="text">
<style:text-properties style:rsid="-965818964"/>
</style:style>
</office:automatic-styles>

<office:body>

<office:texture-mode>

<text:sequence-decls>
<text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>
<text:sequence-decl text:display-outline-level="0" text:name="Table"/>
<text:sequence-decl text:display-outline-level="0" text:name="Text"/>
<text:sequence-decl text:display-outline-level="0" text:name="Drawing"/>
</text:sequence-decls>

<text:p text:style-name="Standard">
<text:span text:style-name="T1">Original text.</text:span>
</text:p>
</office:texture-mode>
</office:body>
</office:document-content>

And after editing the file:


<office:document-content office:version="1.2">
<office:scripts/>

<office:font-face-decls>
<style:font-face style:name="Liberation Serif" svg:font-family="'Liberation Serif'" style:font-family-generic="roman" style:font-pitch="variable"/>
<style:font-face style:name="Arial" svg:font-family="Arial" style:font-family-generic="system" style:font-pitch="variable"/>
</office:font-face-decls>

<office:automatic-styles>

<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="-965818964"/>
</style:style>

<style:style style:name="T1" style:family="text">
<style:text-properties style:rsid="2141787107"/>
</style:style>
</office:automatic-styles>

<office:body>

<office:texture-mode>

<text:sequence-decls>
<text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>
<text:sequence-decl text:display-outline-level="0" text:name="Table"/>
<text:sequence-decl text:display-outline-level="0" text:name="Text"/>
<text:sequence-decl text:display-outline-level="0" text:name="Drawing"/>
</text:sequence-decls>
<text:p text:style-name="P1">Original text.</text:p>

<text:p text:style-name="P1">
<text:span text:style-name="T1">Added text.</text:span>
</text:p>
</office:texture-mode>
</office:body>
</office:document-content>

However, in most of the cases, pretty strange behavior occurs, so now, I'm trying to fix that. I suspect the problems come from the calls to Insert() when an existing document is opened and from having multiple rsid values for one chunk of text.

Saturday, June 13, 2009

Hi all! It's been a few weeks of GSOC and here is what I've been doing.

I spent some time exploring how the current document comparison function works as well as its implementation. It's about a pretty straight forward algorithm seemingly working well in some cases but being not quite satisfactory in other. My initial intention was to deal directly with the algorithm but I was suggested a rather different approach using the rsid (revision save id) functionallity.

Rsid is a unique number attached to each edit session of a document. It helps tracking changes and merging documents and will give an opportunity for a different comparison algorithm. (more info about rsid)

Viewing the rsid as a char attribute (e.g. "bold") I implemented an SvxRsidItem class (like SvxWeightItem for "bold" ) which extends the SfxUInt32Item.

Every time a document is opened/created, the SwDoc constructor initializes the session number (I added a member of the class to store that - nRsid) to a random number. The idea is to use that number every time the content of the document is changed. Currently, I do that in the SwDoc::Insert() method which is called every time new text is inserted. However, this method is called with the last character of every paragraph when a document is opened, which is a bit of a problem but I hope I will resolve that soon. Also, as far as I noticed, it isn't called when text is copy/pasted from the same document...

Currently I am working on saving the rsids in a an ODF document.

Wednesday, May 27, 2009

Welcome!

This is my Google Summer of Code 2009 blog. I will be working on a project to improve Open Office Writer's document comparison function and I will be sharing my progress, thoughts, other here.

Google Summer of Code: go-oo Compare

Followers

Blog Archive

About Me