The procedure described above assumes that an existing backup is being checked for identical files prior to a new backup of a file. This applies to files in the previous backup as well as to the newly created one. Of course it does not make much sense to directly compare every file to be backed up with the previous backup. So, the md5 sums of the previous backup are being compared with the md5 sum of the file to be backed up with the utilization of the hash table.

Computing the md5 sum is fast, but in case of a large amount of data it is still not fast enough. For this reason storeBackup checks initially if the file was not altered since the last backup (path + file name, ctime, mtime and size are the same). If that is the case, the md5 sum of the last backup is being adopted and the hard link set. If the initial check shows a difference, the md5 sum is being computed and a check takes place to see if another file with the same md5 sum exists. (The comparison with a number of backup series uses an expanded but similarly efficient process). For this approach only a few md5 sums need to be calculated for a backup. If you want to tune storeBackup, especially if you save via NFS, there are two things you can do:

The follwing performance measurements only show the direct backup time (without calling storeBackupUpdateBackup.pl8). They have been done with a beta version of storeBackup 2.0.
Some background information to the following numbers: The backup was run on an Athlon X2, 2.3 GHz, 4 GB RAM. The NFS server was an Athlon XP, 1.84 GHz, 1.5 GB RAM. The network was running with 100 MBit/s, storeBackup was used with standard parameters. The units of the measurements are in hours:minutes:seconds or minutes:seconds. The size of sourceDir was 12GB, the size of the backup done with storeBackup was 9.2 GB. The backups were done with 4769 directories and 38499 files. linked 5038 files internally which means these were duplicates. The source for the data were my files and the ``Desktop'' from my Windows XP Laptop, i.e. ``real'' data.
The first table shows the time for copying the data to the nfs server with standard programs. The nfs server is mounted with option async9, which is a performance optimization and not the standard configuration.

command duration size of backup
cp -a 28:46 12 GB
tar jcf 01:58:20 9.4 GB
tar cf 21:06 12 GB

All is like it was to expect: tar with compression is much slower than the other ones; and cp is slower than tar, because it has to create lots of files. There is one astonishing number: The size of the backup file of tar jcf ist 9.4 GB, while the resulting size of the backup with is only 9.2 GB. We see the reason for this in the internal linked 5038 files - the duplicates are stored only once with storeBackup.

We do not see the effect of comparing the contents in this benchmark again, but it makes a lot of differences in performance and especially used disk space. If the time stamp of a file is changed, then traditional backup software will store this file in an incremental backup - storeBackup will only create a hard link.

Now let's run on the same contents. The nfs server is still mounted with option async. There are no changes in the source directory between the first to the second or third backup.

storeBackup 1.19, Standard 2.0, Standard 2.0, lateLinks mount with async
1. backup 49:51 100% 49:20 99% 31:14 63%  
2. backup 02:45 100% 02:25 88% 00:42 25% file system read cache empty
3. backup 01:51 100% 01:54 100% 00:26 23% file system read cache filled

We can see the following:

Now let's do the same with an nfs mount without ``tricks'' like configuring async:

command duration size of backup
cp -a 37:51 12 GB
tar jcf 02:02:01 9.4 GB
tar cf 25:05 12 GB

storeBackup 1.19, Standard 2.0, Standard 2.0, lateLinks mount with sync
1. backup 53:35 100% 49:20 100% 38:53 63%  
2. backup 05:36 100% 05:24 96% 00:43 13% file system read cache empty
3. backup 05:10 100% 04:54 95% 00:27 9% file system read cache filled

We can see the following:

Conclusion: If you mount with nfs, you can make it really fast using option lateLinks. See section 7.6 for how to configure it.

Using ``blocked files'' also improves performance a lot because only a small percentage of an image file has to be copied or compressed. See the description about using blocked files for the influence of this option to performance and space needed.

Heinz-Josef Claes 2014-04-20