Safety of the lzip format

Abstract

There is little information about the safety of compressed file formats. Therefore the designers of new compressed formats tend to make random decisions respect to safety, from using overkill check sequences "just in case" to considering integrity checking optional. This article measures the safety of lzip's integrity checking and explains why lzip achieves a high accuracy in the detection of errors. The results show that the lzip format is safe enough to be used in critical safety avionics systems while keeping a high availability. Lzip is a top choice for reliable transmission, distribution, and archiving of important data.

Disclosure statement: The author is also author of the lzip format.

Acknowledgements: The author would like to thank Nissanka Gooneratne for the thousands of hours he has spent testing lzip and for his help in the improvement of lziprecover. Most of the source tarballs used to test lzip were downloaded from the Slackware repository and from the GNU FTP site.

This page is a work in progress. It will be updated from time to time with new data in order to refine the results. Please report any errors or inaccuracies to the author at the lzip mailing list lzip-bug@nongnu.org or in private at antonio@gnu.org.

1 Introduction

Exert yourself in your job as if on each detail you think, on each word you say, on each piece you place, on each hammer blow you give, it depended the salvation of the humankind. Because it depends, believe me.
-- Joan Maragall (Eulogy of living)

Our civilization depends critically on software; it had better be quality software.
-- Bjarne Stroustrup

Lzip is an advanced compressed format offering useful features like efficient parallel compression/decompression, unlimited index size, or random access to any member in a file. The simplicity of lzip is the result of a careful design process, because using a simple and well designed format maximizes the probability of being able to access the data in the future.

The most important task of a decompressor is producing the correct decompressed data from valid compressed data. Other task as important as this one is warning the user when the correct decompressed data can't be produced (for example because the compressed data are corrupt). Failing to perform this second task causes undetected corruption, which may cause irreparable damage. Because of this, one of the goals of the lzip format is making of undetected corruption something almost impossible.

Additionally, the recovery capabilities of lziprecover are possible because of the highly accurate integrity checking of the lzip format. The lzip format improves the probability of recovering from corruption without any special measures beyond standard backup practices.

Unfortunately, integrity checking is sometimes neglected by designers of compressed formats in order to gain speed. For example, the format of the POSIX tool 'compress' does not include a check sequence, and integrity checking is optional in formats like xz or zstd. The use of such formats for short-term operations may be acceptable under some circumstances, but using them for long-term archiving would be foolish.

This article measures the safety of lzip's integrity checking and explains why lzip achieves such a high accuracy in the detection of errors. (False negatives are practically zero, and false positives are minimized).

Since 2008 I have tested lzip with unzcrash and have published some results, for example here. This article presents the latest results. Millions of test decompressions performed since 2015 on files compressed with lzip 1.16 or newer are examined, and the interaction between LZMA compression and CRC32 in the detection of errors is analyzed. The results show that the lzip format is safe enough to be used in critical safety avionics systems while keeping a high availability.

2 Significance of these tests

Supposing that the lzip files being tested here were single-member files with an average compressed size of 1 MiB (256 sectors) and an average uncompressed size of 6 MB, that they were read from a hard drive with a sector size of 4096 bytes and an uncorrectable error rate of one read error per 1e14 bits (1.25e13 bytes) read, and that the decompression rate were of 60 MB/s of decompressed size, we would get the following equivalences:

These numbers seem large, but a format adequate to become the standard compressed format for long-term archiving need to be very safe and be extensively tested. 1000 million computers decompressing just 17 files per day, from drives about as reliable as the one described above, would find one corrupt file per minute. Corruption in downloaded files is much more frequent than that.

3 Accuracy of error detection is a tradeoff

"There can be safety tradeoffs with the addition of an error-detection scheme. As with almost all fault tolerance mechanisms, there is a tradeoff between availability and integrity. That is, techniques that increase integrity tend to reduce availability and vice versa. Employing error detection by adding a check sequence to a dataword increases integrity, but decreases availability. The decrease in availability happens through false-positive detections. These failures preclude the use of some data that otherwise would not have been rejected had it not been for the addition of error-detection coding". ([Koopman], p. 33).

As larger check sequences always increase the number of false positives, the optimal accuracy in the detection of errors is achieved by using the smallest check sequence that reduces the number of false negatives below the desired value. This value is frequently expressed as the mean time between false negatives (MTBF) for a given working rate. Once the MTBF reaches a large enough value (e.g., the 1e9 hours, about 114_079 years, set by FAA AC 25.1309-1A, page 10, for extremely improbable failure conditions), it does not make sense to continue increasing the size of the check sequence.

Lzip achieves a high accuracy of error detection by keeping the format small and simple. The lzip format is formed by the compressed data, preceded by a header containing the parameters needed for decompression, and followed by a trailer containing integrity information. This design minimizes both overhead and false positives. The lzip format is free from any source of false positives beyond the header and trailer.

The probability of a bit flip in a lzip file resulting in a false positive is equal to 26 * number_of_members / file_size. One good thing about the false positives reported by lzip is that, because of its 3 factor integrity checking, in most cases they can be identified as false positives, and the data can be fully recovered without special tools.

In any case, it is important to always check the integrity of the decompressed data because decompression is not a trivial task and the check not only guards against corruption in the compressed file, but also against memory errors, undetected bugs in the decompressor, etc.

3.1 Why CRC32 instead of CRC32-C or CRC64?

Because a CRC32 offers by itself a MTBF longer than 1e9 hours for a decompression rate of up to 4 corrupt files per hour, which is perfectly adequate for most uses. The CRC32-C (Castagnoli) is slightly better for small uncompressed sizes than the CRC32 (IEEE 802.3 Ethernet) used by lzip, but for uncompressed sizes between 2 and 4 GiB, the CRC32 used by lzip is slightly better. Here "better" means "has a larger Hamming distance". Moreover, the slight advantage of CRC32-C for small sizes is probably nullified by the error multiplication of decompression.

The check sequence (for example a CRC) is not some pixie dust that magically detects corruption in the data. It is also data, and therefore it may also become corrupt, causing the unfortunate situation called "false positive", where the real data is intact but the decoder discards it as if it were corrupt. CRC64 is overkill for lzip because it doubles the probability of a false positive compared with CRC32 without significantly reducing the (already insignificant in the case of lzip) probability of a false negative.

4 Interaction between compression and check sequence

Verification of data integrity in compressed files is different from other cases (like Ethernet packets) because the data that can become corrupted are the compressed data, but the data that are verified (the dataword) are the decompressed data. Decompression can cause error multiplication; even a single-bit error in the compressed data may produce any random number of errors in the decompressed data, or even modify the size of the decompressed data.

Because of the error multiplication caused by decompression, the error model seen by the check sequence is one of unconstrained random data corruption. (Remember that the check sequence verifies the integrity of the decompressed data). This means that the choice of error-detection code (CRC or hash) is largely irrelevant, and that the probability of an error being undetected by the check sequence (Pudc) is 1 / (2^n) for a check sequence of n bits. (See [Koopman], p. 5). Note that if some errors do not produce error multiplication, a CRC is then preferable to a hash of the same size because of the burst error detection capabilities of the CRC.

Decompression algorithms are usually able to detect some errors in the compressed data (for example a backreference to a point before the beginning of the data). Therefore, if the errors not detected by the decoder are not "special" for the check sequence (neither guaranteed to be caught nor guaranteed to be missed), then the total probability of an undetected error (Pud = false negative) is the product of the probability of the error being undetected by the decoder (Pudd) and the probability of the error being undetected by the check sequence (Pudc): Pud = Pudd * Pudc.

It is also possible that a small error in the compressed data does not alter at all the decompressed data. Therefore, for maximum availability, only the decompressed data should be tested for errors. Testing the compressed data beyond what is needed to perform the decompression increases the number of false positives much more than it can reduce the number of undetected errors.

4.1 Interaction between LZMA compression and CRC32

The LZMA data stream provides embedded error detection. Any distance larger than the dictionary size acts as a forbidden symbol, allowing the decoder to detect the approximate position of errors, and leaving very little work for the check sequence in the detection of errors.

Two kinds of errors not detected by the LZMA decompressor are the alteration of a single byte and the replacement of several occurrences of a concrete byte value by a different concrete byte value. (For example the replacement of some 'a' characters by 'o' characters). The first kind of error is guaranteed to be detected by the CRC. The second kind of error is guaranteed to be detected by the CRC except when the number of bytes replaced is a multiple of 2^32 - 1, which is impossible in files smaller than 4 GiB uncompressed and very improbable in files larger than that. (The same value must be fed to a CRC32 2^32 - 1 times for the CRC value to repeat itself).

Because of the above, the real Pud of lzip is probably lower (better) than the current estimate based on a model of unconstrained random data corruption.

5 Error models

Two error models have been used to measure the accuracy in the detection of errors. The first model consists of random bit flips in the compressed file. Just one bit flip per trial. The second model consists of zeroed 512-byte blocks, simulating a whole sector I/O error. Just one zeroed block per trial. The blocks are tried at every byte position because testing only blocks aligned to a 512-byte boundary would require a testing corpus too large. The first model is considered the most important because bit flips are the most difficult to detect (in the compressed data), and they happen even in the most expensive hardware [MSL].

Trial decompressions were performed using lziprecover and the 'unzcrash' tool included in the lziprecover package. In October of 2020 the option '--unzcrash' was added to lziprecover, allowing it to run the bit flip tests much faster than the 'unzcrash' tool. The exact commands used for the two models are:

lziprecover --unzcrash file.lz | lzip -9 -o file.lz.1nht.lz
unzcrash -b1 -p7 -s-20 'lzip -t' file.lz 2>&1 | lzip -9 -o file.lz.1nht.lz
unzcrash -B -d1 -p7 -s-24 'lzip -t' file.lz 2>&1 | lzip -9 -o file.lz.512-00d.lz

5.1 Estimated Pud of lzip for single-bit errors

2763 million trial decompressions have been done running the commands shown in the previous section on real lzip files mainly downloaded from Slackware and GNU. (See section Links to test results for listings of the results files). All those errors were detected by the decoder except 79, which were then detected by the CRC.

For the single-bit error model, the Pudd of a LZMA decoder like the one in lzip is estimated from the data above to be about 2.859e-8. (But it increases with file size, so it may be different for other files). This additional detection capability is supposed to reduce the Pud by the same factor respect to the Pudc of the CRC32 alone (about 2.328e-10). (But see Interaction between LZMA compression and CRC32).

Therefore, the estimated Pud for a bit flip in a lzip file, based on the data above, is of about 2.859e-8 * 2.328e-10 = 6.656e-18.

Note that the Pud of lzip can only be roughly estimated because not even one false negative has ever been observed. (And, if the estimate is accurate, it is not expected to ever find one in files larger than about 100 kB by testing).

Estimating lzip's Pud for single-bit errors on small files (a few kB compressed) is a pending task. Preliminary tests show that about 3 orders of magnitude more errors escape detection by the decoder because of the small size, but almost all of these errors seem to be of the kind that is guaranteed to be detected by the CRC.

5.2 Zeroed block errors

For the zeroed-block error model, the additional detection capability of a LZMA decoder is probably much larger. A LZMA stream is a check sequence in itself, and large errors seem less probable to escape detection than small ones. In fact, we do not try to estimate the Pud of lzip for zeroed blocks because it is too small. The lzip decoder detected the error in all the 85_683_963 trial decompressions run with a zeroed block.

6 Links to test results

These are the listings of the results files produced by lziprecover and unzcrash for all the files tested to date. The files produced by lziprecover are much smaller because they only contain the interesting cases, excluding the errors easily caught by the decoder. The files themselves are not published because they are many, some are large, and all are easily reproducible from the corresponding source file.

7 Conclusions

Lzip achieves optimal accuracy in the detection of errors; false positives are minimized, and false negatives are reduced to an insignificant level. This allows lzip provide both high safety and high availability, which makes of lzip a top choice for reliable transmission, distribution, and archiving of important data.

Lzip's accuracy in the detection of errors can't be significantly improved by using a larger check sequence because the probability of a false negative in lzip is already insignificant. Moreover, the capability of the lzip decoder to detect the approximate position of the error can't be replaced by an increase in the size of the check sequence.

After 12 years of testing, the MTBF of lzip can only be estimated because not even one false negative has ever been observed. If one were to continually decompress corrupt lzip files of about one megabyte in size (10 decompressions per second), each of them containing the kind of corruption most difficult to detect (one random bit flip), then a false negative would be expected to happen every (1 / Pud) / 10 / 31_556_952(s/year) = 476 million years. For large tarballs like gcc or linux, the estimated MTBF would be two orders of magnitude longer.

For the kind of files and errors tested here, the lzip format is safe enough as to easily meet the safety requirements of critical safety avionics systems. For decompression rates as high as (1 / Pud) / 1e9 / 3600 = 41_733 corrupt members per second, the probability of a false negative is lower than the catastrophic failure probability limit of 1e-9 per flight hour set by FAA AC 25.1309-1A. Additionally, the simplicity of the lzip format makes it easier for decompressor implementations to meet safety standards for software.

8 References

9 Glossary


Copyright © 2020 Antonio Diaz Diaz.

You are free to copy and distribute this article without limitation, but you are not allowed to modify it.

First published: 2020-12-27

This page does not use javascript.