Wednesday, June 24, 2009

continued

I use the rsid in a similar way when comparing chars in the ChangesInLine method, which looks for similarities between two lines (paragrphs) which are not identical. Currently it checks for identical chars in the beginning and end of the two lines, so if you have the text:

This is a paragraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, and does nothing in the 'middle'.
and add "short" in the beginning, and delete "does" in the end, the result of comparing will be:

This is a short paragraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, andparagraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, and does nothing in the 'middle'.
If you just delete the first and last chars the function will say that there are no similarities, while if you have two completely different paragraphs with the same first letter it will not mark as them completely different as (i.e. ChangesInLine will return true ).

To deal with this I decided to use the longest common subsequence of the two paragraphs, which allows for more accurate tracking of changes/similarities. The above comparison now looks like that:

This is a short paragraph to test the ChangesInLine() method of the class SwCompareLine. It doesn't work very well because it only checks for equal characters in the beginning and end of the paragraphs that are being compared, and does nothing in the 'middle'.
If there aren't enough similarities (the lcs will always contain some letters) I want the two paragraphs to be marked as different (i.e. ChangesInLine to return false) so I check if the length of the lcs is at least half of the length of the shorter paragraph or that the length of the longest continuous chunk of text in the lcs is at least ~10% of the length of the shorter paragraph.

No comments:

Post a Comment