Friday, July 3, 2009

Comparing paragraphs (lines)

Introducing the longest common subsequence in the ChangesInLine() method (which compares chars to find similarities between two paragraphs) was certainly an improvement to the document comparison function but there is still the question about comparing lines(paragraphs). The current implementation is pretty good at finding the lines that are identical in the two documents (the unchanged lines). However, it's not so good at matching the rest of the lines (the changed lines) which includes finding which lines were entirely inserted or deleted, and which were changed only a bit (then they are compared by char with ChangesInLine()). This's where the rsid is very helpful.

The idea is to use some comparison algorithm, such as lcs, on the paragraph rsids of two arrays of lines(paragraphs) that are inserted/deleted/changed. The ones that fall in the lcs will then be changed and we call ChangesInLine(). The rest are simply inserted or deleted. This happes in the CheckForChangesInLine() method. Currently, this method calls ChangesInLine() for the first lines in each array, then for the second lines and so on, and stops when two lines are completely different. Thus, if someone makes the following list of names:

Theresa
Bernice
Kristin
Elsie
Lois

and then someone adds three more names as well as family names:

Theresa Maldonado
Bernice Cervantes
Michelle Rosario
Kristin Wiley
Elsie Howell
Amber Weiss
Laura Dickerson
Lois Burnett

the result of comparing the new document with the old is:

Theresa Maldonado
Bernice Cervantes
Michelle Rosario
Kristin Wiley
Elsie Howell
Amber Weiss
Laura Dickerson
Lois Burnett
Kristin
Elsie
Lois

On the other hand, the improved lcs&rsid comparison gives:

Theresa Maldonado
Bernice Cervantes
Michelle Rosario
Kristin Wiley
Elsie Howell
Amber Weiss
Laura Dickerson
Lois Burnett

No comments:

Post a Comment