In the previous post I showed why having a measurment of similarity of strings that is both accurate and easy to compute fast is very important - without it the second step would fail and we wouldn't be able to match anything but paragraphs that haven't changed at all.
The following ratios measure to some degree the similiary between two paragraphs:
1) average length of the two paragraphs / length of lcs
2) average length / length of longest common substring
3) squared average length/ sum of squares of common substrings
Choosing the best among them as well as a reference constant required a bit of statistics which I, unable to find it online, did on my own.
I used paragraphs of various lengths (6 to 2268 characters), and various edit distances (0.29% to 56% contents changed - between couples of paragraphs meant to be matched)
I computed the above ratios for each pair of paragraphs (blue are the pairs of similar paragraphs, red are the different ones)
Here is the data; the second row histograms are zoomed-in views only including the leftmost 8% of the different paragraphs.
1) if we choose border value 2:
23% of the similar paragraphs are less than 25% under the limit
41% of the different paragraphs are less than 25% above the limit
2) if we choose border value 8:
8% of the similar paragraphs are less than 25% under the limit
1% of the different paragraphs are less than 25% above the limit
3) if we choose border value 30:
0% of the similar paragraphs are less than 25% under the limit
0% of the different paragraphs are less than 25% above the limit
The conclusion is:
1) inaccurate and slower
2) quite accurate and fastest
3) most accurate but slower
So a good trade off between time and accuracy is to use 2) for step 2 and 3) for step 3 (see previous post for the "steps")