Hard links

Ben Escoto bescoto@stanford.edu
Sun, 10 Mar 2002 00:28:15 -0800

Content-Type: text/plain; charset=us-ascii

Recently a user requested that rdiff-backup preserve hard links.  It
seems that, in general, preserving the full disk space savings of
rdiff is incompatible with that of hard links.  For example, consider
this change:

    A    B
1   F1   F1
2   F2   F3

where A and B are filenames, 1 and 2 are states at a certain time, and
Fx is a file (individuated by inode).  So the above means that we had
a file which was hard linked into two places (A and B) at the time of
one backup.  Later, A and B went in opposite directions.  So under the
current diffing system, the increments directory will hold, after run
2, a diff F2->F1 and a diff F3->F1.  The space saving benefits of hard
linking have disappeared.

    So if hard links are to be supported, I think that any hardlinked
file should only be snapshotted, and never diffed.  So in the above
case, the increments directory would then just contain one copy of F1
(hardlinked itself into two places) and no diffs.  Of course, this
isn't a free lunch, because if F2 or F3 are similar to F1, we have
wasted some disk space.  But I'm guessing that the hardlinking benefit
outweights the only-snapshot disadvantage, for the people who want
hard linking.

    But what, you ask, about cases like:

    A    B
1   F1   F1
2   F2   F2

In these cases, we can have our cake and eat it too, by producing one
diff F2->F1, and hard linking into to places in the increments
directory.  This is true, but I don't want rdiff-backup to make
multiple passes.  If we snapshot in the case above, rdiff-backup won't
know whether to snapshot or diff until after it encounters the second
file.  So I think one-pass is incompatible with handling this case in
the ideal way.

    Hardlink preservation is also incompatible with the current
no-data-files property of rdiff-backup.  Consider this case:

    A    B
1   F1   F1
2   F1   --

Filename A won't have a counterpart in the increments directory,
because it hasn't changed.  File B, on the other hand, should be
snapshotted.  But how do we signal that B was hardlinked to A?  I
want to avoid any hardlinking between the mirror directory and the
increments directory, because it makes sense to put these on different
volumes, and I don't think it would be too hard even to modify
rdiff-backup to put the increments directory on a different computer
using the current ssh-piping scheme.

    So it seems the only way to mark the link between old B and old A
is to write this down in an rdiff-backup specific file.  I'd imagine
that the file would consist of a big dictionary, the values of which
would be a list of filenames which were hardlinked together.  Each
value would have a key for each of its members, e.g. list = ['a','b'],
dictionary = {'a':list, 'b':list}.  To keep things simple I think I
new dictionary could just be written at each backup.  If this takes up
too much disk space later on some diffing scheme could be added so
older dictionaries can be recovered from the newest ones + diffs.

    To summarize:  If a file is hardlinked, don't use rdiff and just
take snapshots.  When preserving hardlinks, just write a big data file
each backup which lists all the files which are hardlinks and what
they are hardlinked to.

Does this seem like the right way to handle hardlinks?  Did I miss any
possibilities above?  Would hardlink support still be useful given
these limitations?  Thanks for any input.

Ben Escoto

Content-Type: application/pgp-signature

Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Exmh version 2.5 01/15/2001