[rproxy-devel] rdiff deltas not very good compared to pysync, why?

Shirish H. Phatak shirish@tacitnetworks.com
Wed, 24 Apr 2002 12:52:45 -0400

This is a multi-part message in MIME format.
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Hi all,

      I have been really backed up and could not get to this until 
today. Sorry for the delay.

      I am attaching a patch against anonymous cvs that should fix the 
patch size problem. The patch will also cleanly apply to the 0.9.5 tree. 
The problem was my fault, I forgot to roll out bytes I was skipping over 
as an optimization in rs_delta_scan to avoid recomputing the weak 
checksum. With this patch the size of the rdiff delta with Donovan's 
samples was 131091 bytes. Please let me know if this solves the problem. 
Incidently bugs with rolling checksums are most easily detected by using 
the --paranoia option in rdiff which will abort on the first mismatched 
sum;  thanks to Markus for pointing this out.

      This patch also includes a memory leak fix for a scoop buffer leak 
(upto 16k per job with default parameters). I had submitted the fix 
earlier but it probably got lost. This will only affect those who are 
using librsync as a library in long running programs.

       Since there appears to be a dedicated group of users and lots of 
activity, maybe we can convince Martin to roll in these patches and make 
a new release?


Donovan Baarda wrote:

>Just been doing some work on librsync for a Python extension, and noticed
>that it is producing deltas more than twice as big as pysync produces, using
>the same block size. I'm using the released v0.9.5 code.
>I'm using some test files I generated for pysync testing. These consist of a
>256K random data "oldfile.bin", and a slightly larger "newfile.bin" that
>includes random edits (insert,replace,delete,copy) of "oldfile.bin". Because
>this is all random data, it doesn't compress.
>pyproxy can produce both rsync and xdelta style deltas. The xdelta results
>should be pretty close to optimal, so they make a good basis to compare
>The default block size for pyproxy is 1024, so I used "-b 1024" when running
>rdiff to force the same block size. The results I got were;
>Operation     	       size
>source oldfile.bin     262144
>target newfile.bin     325316
>rdiff signature	         3084
>pyproxy sig	         8090
>pyproxy xdelta	       103463
>pyproxy rdelta	       131389
>rdiff delta	       319252
>As you can see, the "rdiff signature" was less than half the size of than
>"pysync sig". This is understandable, as pysync uses a Python pickled dict
>of dicts for it's sigfile format.
>However, the "rdiff delta" is more than two times the size of "pysync
>rdiff", and more than three times the optimal "pyproxy xdelta". Since pysync
>uses a pickled Python list of (offset,length) tupples and insert strings, I
>find this very surprising. None of these are faulty, as a "patch" by any of
>the tools uses the correct result.
>Note that pysync does use gzip context compression (compressing the whole
>data stream, including hits, but only including the compressed output of
>misses), and I don't thing rdiff does. However, in this case the input data
>was all random so compression has no effect. Compressing any of the inputs
>or outputs yeilds negligable change.
>I haven't examined the librsync code to figure out why yet, but I suspect
>that there might be a bug in the rolling checksums. There is certainly
>something wrong.

Content-Type: text/plain;
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;

? COPYING_delta
Index: delta.c
RCS file: /cvsroot/librsync/delta.c,v
retrieving revision 1.29
diff -u -r1.29 delta.c
--- delta.c	8 Aug 2001 04:58:17 -0000	1.29
+++ delta.c	24 Apr 2002 16:19:25 -0000
@@ -164,8 +164,8 @@
     rs_long_t            match_where;
     int                  search_pos, end_pos;
     unsigned char        *inptr = (unsigned char *) p;
-    uint32_t             s1 = job->weak_sig & 0xFFFF;
-    uint32_t             s2 = job->weak_sig >> 16;
+    unsigned             s1 = job->weak_sig & 0xFFFF;
+    unsigned             s2 = job->weak_sig >> 16;
     if (job->basis_len) {
         rs_log(RS_LOG_ERR, "somehow got nonzero basis_len");
@@ -190,8 +190,18 @@
     for (search_pos = 0; search_pos <= end_pos; search_pos++) {
         size_t this_len = job->block_len;
-        /* Did we inherit the signature from rs_delta_match?*/
+        /* Did we inherit the signature from rs_delta_match, if
+         * so we know this block won't match and we should simply
+         * skip the first char.
+         */
         if (job->have_weak_sig < 0) {
+            /* advance by one; roll out the byte we just skipped over. */
+            unsigned char a = inptr[search_pos];
+            unsigned shift = a + RS_CHAR_OFFSET;
+            s1 -= shift;
+            s2 -= this_len * shift;
+            job->weak_sig = (s1 & 0xffff) | (s2 << 16);
             job->have_weak_sig = 1;
             /* We already know that this block won't match!*/
Index: job.c
RCS file: /cvsroot/librsync/job.c,v
retrieving revision 1.28
diff -u -r1.28 job.c
--- job.c	9 Apr 2001 15:02:40 -0000	1.28
+++ job.c	24 Apr 2002 16:19:25 -0000
@@ -84,6 +84,9 @@
 rs_result rs_job_free(rs_job_t *job)
+    if (job->scoop_buf)
+            free(job->scoop_buf);
     rs_bzero(job, sizeof *job);