Saving Image Files / raw Devices / Blocked Files

The scope of blocked files

Saving big image files which change only in parts completely with every backup is inefficient, as it is very time and space consuming. To give some examples:

How it works

If you specify a file to be saved in block files (see below how to do this), then storeBackup.pl will do the following:

\includegraphics[scale=.9]{blockedFile}
  1. Create a directory with the same combination of path and file name of the original image file in the source directory.
  2. Split the source file into blocks and check if any of these blocks exist anywhere in a backup (see option otherBackupSeries of storeBackup.pl). If a block already exists, a hard link is generated, if it does not exist, the block will be copied or stored compressed.
  3. The md5 sum of all these files will be stored in a special file called .md5BlockCheckSums.bz2 in that directory.
  4. storeBackup.pl will also calculate the md5 sum of the whole file and store it in .md5CheckSum.
Because references to existing files are realized via hard links, every backup is a full backup.

If you use the option lateLinks, the links will be set later. If you also use the option lateCompress, the compression will also be done later.

How to save image files

There are two ways to configure which files storeBackup.pl should treat as blocked files:

  1. The easiest way is using the following options:
    checkBlocksSuffix
    The configuration is similar to exceptSuffix, a list of suffixes which are checked for a match, e.g., $\backslash$.vdmk for VMware images. They simply mean that the last part of the file name must be similar to what you define here.
    The next options described here are only used if checkBlocksSuffix is set.
    checkBlocksMinSize
    Only files with this minimum size will the treated as blocked files. You can use the same shortcuts as described in defining rules, e.g., 50M means 50 megabytes. The default value is 100M.
    checkBlocksBS
    Defines the block size in which the files which matches has to be split by storeBackup.pl. The format is equal to checkBlocksMinSize. The default value is 1M. The minimal value is 10k.
    checkBlocksCompr
    Defines if the blocks are compressed. Possible values are yes, no or check. On the command line, set --checkBlocksCompr.
    This flag only affects files selected with checkBlocksSuffix.
    Example:
    You want to backup all your VMware images and you also have to backup some Outlook.pst files. The blocked file feature will be chosen from storeBackup for files with a minimum size of 50 megabyte ending with .vmdk or .pst. The block size chosen is 500k and the resulting blocks in the backup will be compressed:

    checkBlocksSuffix = '\.vmdk' '\.pst'
    checkBlocksMinSize = 50M
    checkBlocksBS = 500k
    checkBlocksCompr = yes
    

  2. The more flexible way to specify the handling of blocked files is to use rules like described in defining rules. The following options are available five times, so there is a checkBlocksRule0, checkBlocksRule1, checkBlocksRule2, checkBlocksRule3 and checkBlocksRule4:
    checkBlocksRulei
    The ith rule specifying files to treat as blocked files in the backup.
    checkBlocksBSi
    The corresponding block size for the blocks in the backup. The default value is 1 megabyte. The minimal value is 10k.
    checkBlocksCompri
    If set to yes, the blocks will be compressed. If set to no, they will not be compressed. If set to check, storeBackup will decide itself if they will be compressed. This may result in a mix of compressed and copied blocks.
    checkBlocksReadi
    Defines a filter for reading the specified file, e.g., gunzip or gzip -d. This option may be useful if you have to save an already compressed image file. (Using the ``blocked file'' feature of storeBackup with already compressed files compressed as a whole does not make sense.)
    Example:
    Let's assume, you have a TrueCrypt image on your disk and want to have a backup of it each time you start storeBackup.pl. You chose the unremarkable name myPics.iso, block size is 1M, no compression. So you define rule 0:

    checkBlocksRule0= '$file =~ m#/myPics\.iso$#'
    #checkBlocksBS0=
    #checkBlocksCompr0=
    checkBlocksRule1= '$size > &::SIZE("50M")' and
            ( '$file =~ m#\.pst$#' or '$file =~ m#windows_D/Outlook/#' )
    checkBlocksBS1=200k
    checkBlocksCompr1=check
    

    You also defined rule 1, which matches for all files bigger than 50 megabytes ending with .pst or located in the relative path windows_D/Outlook/ in the backup. (I'm using this to backup the data of my dual boot laptop.) If you are not familiar with rules in storeBackup, you should read section 7.4.

You can use checkBlocksSuffix and checkBlocksRule i at the same time in one configuration file. StoreBackup evaluates checkBlocksRulei (in ascending order) first and then checkBlocksSuffix.

how to save mass storage devices

Backing up a mass storage device (like /dev/sdc or /dev/sdc1) works in the same way as saving an image file with storeBackup. You choose the device(s) with checkDevices i, the block size in the backup with checkDevicesBS i and switch compression on or off with checkDevicesCompri. Additionally, you have to specify the relative path with checkDevicesDiri in the backup where the contents of the devices will be stored.

The blocks in the backup resulting from image files or devices are hard linked if storeBackup finds the same contents.

The options are in detail:

checkDevicesi
List of devices (e.g., /dev/sdd2 /dev/sde1) to backup.
--checkDevicesDiri
Directory where the devices are stored in the backup (relative path). The image file will also be restored in that directory if you restore the backup with storeBackupRecover.pl (if you use default parameters). Into this directory storeBackup will create a subdirectory which name is generated from the parameters of option checkDevices, e.g., /dev/sdc will result in dev_sdc.
checkDevicesBSi
Defines the block size in which the devices specified have to be split by storeBackup.pl. The format is equal to checkBlocksMinSize. The default value is 1M. The minimal value is 10k.
checkDevicesCompri
Defines if the blocks are compressed. Possible values are yes, no or check; the default value is no.
This option only affects files selected with checkDevicesi. If you set this option to check, every block is checked for compression (or not).

Choosing the block size

There is no fix rule about the ``best'' block size. I made some measurements about the block size and the used space. The second backup was done with lateLinks (see section 7.6), so I could use df again to see how much space was really needed. The used file system was reiserfs with tail packing. If you use a file system without tail packing (like ext2, ext3 or ext4), the overhead will be bigger and small block sizes are less attractive (same if you use compression). The results also depend on the application writing to your source image file.
All the examples are done without compression (for performance reasons). They were done with real data. Naturally, I'm using compression in my real backups. The 2nd backup shows the space needed for the changed data. The percentage line below shows the relation between the first and the second backup. The sums line shows the sum of the first and second backup, the next line (1x) the relationship between that sum depending on the last value with 5M (5 megabyte blocks). The last line shows the same relationship regarding the size of the first backup and 10 times the second one (extrapolating 10 backups). So this should be the most interesting value.

The first example shows the results when storing a big Outlook.pst file of 1.2GB with the changes I had from one day to the other:

BlockSize 50k 100k 200k 1M 5M
1. backup [kB] 1219253 1172263 1172863 1173801 1173724
2. backup [kB] 7692 13445 22720 73826 240885
  0.63% 1.15% 1.94% 6.29% 20.52%
sum [kB] 1226945 1185708 1195583 1247627 1414609
1x 86.73% 83.82% 84.52% 88.20% 100.00%
10x 36.18% 36.47% 39.08% 53.37% 100.00%

The second example was done with a smaller Outlook file of 117 megabyte. This is the one for the input folder. The numbers show a different behavior than in the first example.

BlockSize 50k 100k 200k 1M 5M
1. backup [kB] 122487 118221 118891 119184 119181
2. backup [kB] 33400 51240 74424 107632 119181
  27.27% 43.34% 62.60% 90.31% 100.00%
sum [kB] 155887 169461 193315 226816 238362
1x 65.40% 71.09% 81.10% 95.16% 100.00%
10x 34.82% 48.10% 65.84% 91.19% 100.00%

The third example shows the results when storing a VMware image of 2.1 GB. Between the first and the second backup the VM was booted, a program for updating my navigational system was updated and I connected the navigational system for an update also.

BlockSize 50k 100k 200k 1M 5M
1. backup [kB] 2162595 2106781 2112547 2117178 2117094
2. backup [kB] 53656 80609 131701 438241 1112652
  2.48% 3.83% 6.23% 20.70% 52.56%
sum [kB] 2216251 2187390 2244248 2555419 3229746
1x 68.62% 67.73% 69.49% 79.12% 100.00%
10x 20.38% 21.99% 25.90% 49.08% 100.00%

In all these examples you can see in the last line, that at some point smaller block sizes will not reduce the space needed. An optimum values seems to be between 50k and 200k (when using tail packing).

There is one additional important aspect about the block size: If you choose a small block size, the performance will also go down. To be able to achieve acceptable performance, the following optimizations are implemented:

it is best to make your own tests to get a feeling of useful block sizes in your use cases.

Heinz-Josef Claes 2014-04-20