[rproxy-devel] rdiff deltas not very good compared to pysync, why?

Shirish H. Phatak shirish@tacitnetworks.com
Thu, 18 Apr 2002 10:14:43 -0400


   Can I have a look at the test files? I believe my patch which 
improves the size of deltas for sequences of matches is already in 
0.9.4+. This patch also fixed a rolling checksum bug.

    I am not very familliar with pysync, but I could definitely take a 
look at the rdiff output to see if anything is obviously wrong.


Donovan Baarda wrote:

>Just been doing some work on librsync for a Python extension, and noticed
>that it is producing deltas more than twice as big as pysync produces, using
>the same block size. I'm using the released v0.9.5 code.
>I'm using some test files I generated for pysync testing. These consist of a
>256K random data "oldfile.bin", and a slightly larger "newfile.bin" that
>includes random edits (insert,replace,delete,copy) of "oldfile.bin". Because
>this is all random data, it doesn't compress.
>pyproxy can produce both rsync and xdelta style deltas. The xdelta results
>should be pretty close to optimal, so they make a good basis to compare
>The default block size for pyproxy is 1024, so I used "-b 1024" when running
>rdiff to force the same block size. The results I got were;
>Operation     	       size
>source oldfile.bin     262144
>target newfile.bin     325316
>rdiff signature	         3084
>pyproxy sig	         8090
>pyproxy xdelta	       103463
>pyproxy rdelta	       131389
>rdiff delta	       319252
>As you can see, the "rdiff signature" was less than half the size of than
>"pysync sig". This is understandable, as pysync uses a Python pickled dict
>of dicts for it's sigfile format.
>However, the "rdiff delta" is more than two times the size of "pysync
>rdiff", and more than three times the optimal "pyproxy xdelta". Since pysync
>uses a pickled Python list of (offset,length) tupples and insert strings, I
>find this very surprising. None of these are faulty, as a "patch" by any of
>the tools uses the correct result.
>Note that pysync does use gzip context compression (compressing the whole
>data stream, including hits, but only including the compressed output of
>misses), and I don't thing rdiff does. However, in this case the input data
>was all random so compression has no effect. Compressing any of the inputs
>or outputs yeilds negligable change.
>I haven't examined the librsync code to figure out why yet, but I suspect
>that there might be a bug in the rolling checksums. There is certainly
>something wrong.