Tuesday, June 30, 2009

Paragraph rsid?

Working on the comparison made me realize it would be good to also be able to compare paragraphs by rsid (independently of the rsid of text they contain) and I also noticed that the MS Word xml contains something like a paragraph rsid.

After a few days of exploring how the Ts and Ps in the ODF xml work, I wasn't able to set the rsid number independently on the paragraph and its text (some factors like automatic merging at saving and loading were preventing me from doing this, but I believe it is the actual purpose of the Ps and Ts that is making it impossible in the first place) So, the best option (for now) is to make a new rsid attribute and set it only on entire paragraphs:

[...]
<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="001a97ee" style:paragraph-rsid="001a97ee"/>
</style:style>
<style:style style:name="P2" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="001a97ee" style:paragraph-rsid="001b9628"/>
</style:style>
<style:style style:name="T1" style:family="text">
<style:text-properties style:rsid="001b9628"/>
</style:style>
</office:automatic-styles>
<office:body>
<office:text>
[...]
<text:p text:style-name="P1">1111111111111</text:p>
<text:p text:style-name="P2">
<text:span text:style-name="T1">22222222</text:span>

The xml doesn't look (as beautifully) as I would want it to (if I manage to set the text rsid only on text) but it contains all the necessary information.

Wednesday, June 24, 2009

continued

I use the rsid in a similar way when comparing chars in the ChangesInLine method, which looks for similarities between two lines (paragrphs) which are not identical. Currently it checks for identical chars in the beginning and end of the two lines, so if you have the text:

This is a paragraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, and does nothing in the 'middle'.
and add "short" in the beginning, and delete "does" in the end, the result of comparing will be:

This is a short paragraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, andparagraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, and does nothing in the 'middle'.
If you just delete the first and last chars the function will say that there are no similarities, while if you have two completely different paragraphs with the same first letter it will not mark as them completely different as (i.e. ChangesInLine will return true ).

To deal with this I decided to use the longest common subsequence of the two paragraphs, which allows for more accurate tracking of changes/similarities. The above comparison now looks like that:

This is a short paragraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, and does nothing in the 'middle'.
If there aren't enough similarities (the lcs will always contain some letters) I want the two paragraphs to be marked as different (i.e. ChangesInLine to return false) so I check if the length of the lcs is at least half of the length of the shorter paragraph or that the length of the longest continuous chunk of text in the lcs is at least ~10% of the length of the shorter paragraph.

Tuesday, June 23, 2009

comparing with rsids

Now that the rsid is almost done (up to slight modifications and bugs fixing that will probably be necessary in the course of time) I turned to the document comparison itself.

The first more or less obvious use of the rsid is whenever two lines (text nodes) or two chars are compared. In addition to the condition that the two things being compared are the same I add the condition that their rsids are the same. This will ensure that the two things are not the same just by chance (which can easilly be the case when comparing chars or short and repetitive lines in the document) but because they were created in the same edit session.

Here is an example of how it goes in the CompareTxtNd function:

BOOL SwCompareLine::CompareTxtNd( const SwTxtNode& rDstNd,
const SwTxtNode& rSrcNd )
{
BOOL bRet = FALSE;

const SwDoc* pDestDoc = rDstNd.GetDoc();
const SwDoc* pSrcDoc = rSrcNd.GetDoc();

UINT32 nDstRsid = 0;
UINT32 nSrcRsid = 0;

// Only compare rsids if the rsidRoots are the same (docs have the same origin)
if ( pSrcDoc->getRsidRoot() == pDestDoc->getRsidRoot() )
{
nDstRsid = rDstNd.GetRsid( 0, rDstNd.Len() );
nSrcRsid = rSrcNd.GetRsid( 0, rSrcNd.Len() );
}

// erstmal ganz einfach!
if( rDstNd.GetTxt() == rSrcNd.GetTxt() && nDstRsid == nSrcRsid )
{
// der Text ist gleich, aber sind die "Sonderattribute" (0xFF) auch
// dieselben??
bRet = TRUE;
}
return bRet;
}

A few things to note:
The rsid check is applicable only if the two documents originate from the same document i.e. if their root rsids are the same, which is easily checked with the handy getRsidRoot() method;
If the root rsids are not the same or the documents don't have rsids, the two rsids will be 0;
To get the rsids I made a method of SwTxtNode, GetRsid(start,end);

Thursday, June 18, 2009

More rsid

The main functionality of rsid is now ready.You can download a patch with the current changes here.

I had to handle the various ways text can be inserted in the document since every time different functions are called (e.g. when just writing text, pasting it from the same document, pasting from an external source and splitting a paragraph in two.).

Rsids are now being saved as 32 bit hexadecimal numbers in the content.xml of the document, and, following the OOXML standard, each rsid is greater than the previous (the one generated in the previous session). The session rsid as well as the root rsid (the one generated when the original document was created) are saved as document settings in the settings.xml (they are not yet in hex).

Here is an example of how it works:

I created a document with the following text:

This is the original text of a document that has three paragraphs.
Here is the second paragraph of the original text.
Here is the third paragraph of the original text.

And this is the content.xml:

[...]
<office:automatic-styles>
<style:style style:name="T1" style:family="text">
<style:text-properties style:rsid="000134cc"/>
</style:style>
</office:automatic-styles>
<office:body>
<office:text>
[...]
<text:p text:style-name="Standard">
<text:span text:style-name="T1">
This is the original text of a document that has three paragraphs.
</text:span>
</text:p>
<text:p text:style-name="Standard">
<text:span text:style-name="T1">Here is the second paragraph of the original text.</text:span>
</text:p>
<text:p text:style-name="Standard">
<text:span text:style-name="T1">Here is the third paragraph of the original text.</text:span>
</text:p>
</office:text>
in the settings.xml I have the following two lines:
<config:config-item config:name="Rsid" config:type="int">79052</config:config-item>
<config:config-item config:name="RsidRoot" config:type="int">79052</config:config-item>
Then I split the third paragraph:

This is the original text of a document that has three paragraphs.
Here is the second paragraph and some text from the second session of the original text.
Here is the third paragraph
of the original text.

[...]
<office:automatic-styles>
<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="000134cc"/>
</style:style>
<style:style style:name="T1" style:family="text">
<style:text-properties style:rsid="00202baa"/>
</style:style>
</office:automatic-styles>
<office:body>
<office:text>
[...]
<text:p text:style-name="P1">
This is the original text of a document that has three paragraphs.
</text:p>
<text:p text:style-name="P1">
Here is the second paragraph
<text:span text:style-name="T1">and some text from the second session</text:span>
of the original text.
</text:p>
<text:p text:style-name="P1">Here is the third paragraph </text:p>
<text:p text:style-name="P1">
<text:span text:style-name="T1">of the original text.</text:span>
</text:p>
</office:text>
and in the settings:
<config:config-item config:name="Rsid" config:type="int">2108330</config:config-item>
<config:config-item config:name="RsidRoot" config:type="int">79052</config:config-item>
Finally, I copied some text from the first two paragraphs and pasted it in the middle of the third:

This is the original text of a document that has three paragraphs.
Here is the second paragraph and some text from the second session of the original text.
Here is the original text of a document that has three paragraphs.
Here is the third paragraph
of the original text.
[...]
<office:automatic-styles>
<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="000134cc"/>
</style:style>
<style:style style:name="P2" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="00202baa"/>
</style:style>
<style:style style:name="P3" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="000134cc"/>
</style:style>
<style:style style:name="T1" style:family="text">
<style:text-properties style:rsid="00202baa"/>
</style:style>
<style:style style:name="T2" style:family="text">
<style:text-properties style:rsid="003eb852"/>
</style:style>
</office:automatic-styles>
<office:body>
<office:text>
[...]
<text:p text:style-name="P1">
This is the original text of a document that has three paragraphs.
</text:p>
<text:p text:style-name="P1">
Here is the second paragraph
<text:span text:style-name="T1">and some text from the second session</text:span>
of the original text.
</text:p>
<text:p text:style-name="P1">
Here is the
<text:span text:style-name="T2">
original text of a document that has three paragraphs.
</text:span>
</text:p>
<text:p text:style-name="P1">
<text:span text:style-name="T2">Here is the</text:span>
third paragraph
</text:p>
<text:p text:style-name="P2">of the original text.</text:p>
</office:text>
and
<config:config-item config:name="Rsid" config:type="int">4110418</config:config-item>
<config:config-item config:name="RsidRoot" config:type="int">79052</config:config-item>

Monday, June 15, 2009

Saving works!

I managed to save the new rsid attribute in ODF i.e. now, I am able to see the rsid values in the content.xml of my .odt files. It is now saved as a signed decimal style but this will be revised soon. Here is a simple example:



<office:document-content office:version="1.2">
<office:scripts/>

<office:font-face-decls>
<style:font-face style:name="Liberation Serif" svg:font-family="'Liberation Serif'" style:font-family-generic="roman" style:font-pitch="variable"/>
<style:font-face style:name="Liberation Sans" svg:font-family="'Liberation Sans'" style:font-family-generic="swiss" style:font-pitch="variable"/>
<style:font-face style:name="Arial" svg:font-family="Arial" style:font-family-generic="system" style:font-pitch="variable"/>
</office:font-face-decls>

<office:automatic-styles>

<style:style style:name="T1" style:family="text">
<style:text-properties style:rsid="-965818964"/>
</style:style>
</office:automatic-styles>

<office:body>

<office:texture-mode>

<text:sequence-decls>
<text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>
<text:sequence-decl text:display-outline-level="0" text:name="Table"/>
<text:sequence-decl text:display-outline-level="0" text:name="Text"/>
<text:sequence-decl text:display-outline-level="0" text:name="Drawing"/>
</text:sequence-decls>

<text:p text:style-name="Standard">
<text:span text:style-name="T1">Original text.</text:span>
</text:p>
</office:texture-mode>
</office:body>
</office:document-content>


And after editing the file:



<office:document-content office:version="1.2">
<office:scripts/>

<office:font-face-decls>
<style:font-face style:name="Liberation Serif" svg:font-family="'Liberation Serif'" style:font-family-generic="roman" style:font-pitch="variable"/>
<style:font-face style:name="Arial" svg:font-family="Arial" style:font-family-generic="system" style:font-pitch="variable"/>
</office:font-face-decls>

<office:automatic-styles>

<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
<style:text-properties style:rsid="-965818964"/>
</style:style>

<style:style style:name="T1" style:family="text">
<style:text-properties style:rsid="2141787107"/>
</style:style>
</office:automatic-styles>

<office:body>

<office:texture-mode>

<text:sequence-decls>
<text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>
<text:sequence-decl text:display-outline-level="0" text:name="Table"/>
<text:sequence-decl text:display-outline-level="0" text:name="Text"/>
<text:sequence-decl text:display-outline-level="0" text:name="Drawing"/>
</text:sequence-decls>
<text:p text:style-name="P1">Original text.</text:p>

<text:p text:style-name="P1">
<text:span text:style-name="T1">Added text.</text:span>
</text:p>
</office:texture-mode>
</office:body>
</office:document-content>




However, in most of the cases, pretty strange behavior occurs, so now, I'm trying to fix that. I suspect the problems come from the calls to Insert() when an existing document is opened and from having multiple rsid values for one chunk of text.

Saturday, June 13, 2009

Hi all! It's been a few weeks of GSOC and here is what I've been doing.

I spent some time exploring how the current document comparison function works as well as its implementation. It's about a pretty straight forward algorithm seemingly working well in some cases but being not quite satisfactory in other. My initial intention was to deal directly with the algorithm but I was suggested a rather different approach using the rsid (revision save id) functionallity.

Rsid is a unique number attached to each edit session of a document. It helps tracking changes and merging documents and will give an opportunity for a different comparison algorithm. (more info about rsid)

Viewing the rsid as a char attribute (e.g. "bold") I implemented an SvxRsidItem class (like SvxWeightItem for "bold" ) which extends the SfxUInt32Item.

Every time a document is opened/created, the SwDoc constructor initializes the session number (I added a member of the class to store that - nRsid) to a random number. The idea is to use that number every time the content of the document is changed. Currently, I do that in the SwDoc::Insert() method which is called every time new text is inserted. However, this method is called with the last character of every paragraph when a document is opened, which is a bit of a problem but I hope I will resolve that soon. Also, as far as I noticed, it isn't called when text is copy/pasted from the same document...

Currently I am working on saving the rsids in a an ODF document.