[rproxy-devel] rdiff deltas not very good compared to pysync, why?

Donovan Baarda abo@minkirri.apana.org.au
Fri, 19 Apr 2002 01:18:28 +1000

On Thu, Apr 18, 2002 at 10:14:43AM -0400, Shirish H. Phatak wrote:
> Hi,
>   Can I have a look at the test files? I believe my patch which 
> improves the size of deltas for sequences of matches is already in 
> 0.9.4+. This patch also fixed a rolling checksum bug.
>    I am not very familliar with pysync, but I could definitely take a 
> look at the rdiff output to see if anything is obviously wrong.

I've sent the files plus pysync itself to you directly.

I have a few theories about things that might contribute to pysync doing

pysync uses pure adler32 checksums for the rolling checksum, which may or
may not be better than what rdiff uses. The signature is an adler32 keyed
dictionary of md5sum keyed dictionarys containing offsets. This means that
when there are rolling checksum collisions (two different blocks with same
rolling checksum), pysync will compare md5sums against all blocks with that
rolling checksum. 

rdiff might use what pysync's xdelta implementation uses instead; a
mapping of rolling checksums to a single offset (and md4sum). What this
means is rolling checksum collisions result in the coliding block
"disapearing" as a potential match. A high level of rolling checksum
collisions results in a high number of "misses" that could have been

The reason pysync's xdelta implentation uses this simplified approach is
xdelta uses a much smaller block size, and it automatically byte extends any
match forwards and backwards as far as it can go. This means a few false
misses have less impact, because they are picked up by the byte extending

There is a useful heuristic of favoring the earliest block when there is a
collision, on the basis that extending matches forwards from the earlier
match gives longer match sequences. 

ABO: finger abo@minkirri.apana.org.au for more info, including pgp key