Safety of the lzip format

Abstract

There is little information about the safety of compressed file formats. Therefore the designers of new compressed formats tend to make random decisions respect to safety, from using overkill check sequences "just in case" to considering integrity checking optional. This article measures the safety of lzip's integrity checking and explains why lzip achieves a high accuracy in the detection of errors. The results show that the lzip format is safe enough to be used in critical safety avionics systems while keeping a high availability. With its simple design and an estimated MTBF of hundreds of millions of years, lzip is an excellent choice for reliable transmission, distribution, and archiving of important data.

Disclosure statement: The author is also author of the lzip format.

Acknowledgements: The author would like to thank Nissanka Gooneratne for the thousands of hours he has spent testing lzip and for his help in the improvement of lziprecover. Most of the source tarballs used to test lzip were downloaded from the Slackware repository and from the GNU FTP site.

This page is a work in progress. It will be updated from time to time with new data in order to refine the results. Please report any errors or inaccuracies to the author at the lzip mailing list lzip-bug@nongnu.org or in private at antonio@gnu.org.

1 Introduction

Exert yourself in your job as if on each detail you think, on each word you say, on each piece you place, on each hammer blow you give, it depended the salvation of the humankind. Because it depends, believe me.
-- Joan Maragall (Eulogy of living)

Lzip is an advanced compressed format offering useful features like efficient parallel compression/decompression, unlimited index size, or random access to any member in a file. The simplicity of lzip is the result of a careful design process, because using a simple and well designed format maximizes the probability of being able to access the data in the future.

The most important task of a decompressor is producing the correct decompressed data from valid compressed data. Other task as important as this one is warning the user when the correct decompressed data can't be produced (for example because the compressed data are corrupt). Failing to perform this second task causes undetected corruption, which may cause irreparable damage. Because of this, one of the goals of the lzip format is making of undetected corruption something almost impossible.

The recovery capabilities of lziprecover (the data recovery tool for the lzip format) are possible because of the highly accurate integrity checking of the lzip format. The lzip format improves the probability of recovering from corruption without any special measures beyond standard backup practices.

Unfortunately, integrity checking is sometimes neglected by designers of compressed formats in order to gain speed. For example, the format of the POSIX tool 'compress' does not include a check sequence, and integrity checking is optional in formats like xz or zstd. The use of such formats for short-term operations may be acceptable under some circumstances, but using them for long-term archiving would be foolish.

This article measures the safety of lzip's integrity checking and explains why lzip achieves such a high accuracy in the detection of errors. (False negatives are practically zero, and false positives are minimized).

Since 2008 I have tested lzip with unzcrash and have published some results, for example here. This article presents the latest results. Thousands of millions of test decompressions performed since 2015 on files compressed with lzip 1.16 or newer are examined, and the interaction between LZMA compression and CRC32 in the detection of errors is analyzed. The results show that the lzip format is safe enough to be used in critical safety avionics systems while keeping a high availability.

2 Significance of these tests

Almost all the packages from the GNU project and the Slackware distribution (plus some files from other sources) have been tested for all possible bit flips and all possible 512-byte sector read errors in the LZMA stream. Almost 1700 files in total. As far as I know, this is the largest controlled test ever run on a compressed format.

If we suppose the following conditions:

The files tested are single-member files with an average compressed size of 1 MiB and an average uncompressed size of 6 MB.
The files are read from a hard drive with a sector size of 4096 bytes and an uncorrectable error rate of one read error per 1e14 bits (1.25e13 bytes).
The decompression rate is of 60 MB/s of decompressed size.

then we get that the 15_476 million test decompressions done to date on files with simulated errors are equivalent to 584.6 million years of continuous normal use of lzip decompressing files read from a drive. Or one million servers decompressing lzip files from hard drives for 584.6 years.

Note that the MTBF measured in this article is for continuous decompression of corrupt lzip files. The MTBF in real use is much longer than the age of the universe because of the multiplying effect of the low error rate of the drive.

These numbers seem large, but a format adequate to become the standard compressed format for long-term archiving needs to be very safe and be extensively tested. 1000 million computers decompressing just 17 files per day, from drives about as reliable as the one described above, would find one corrupt file per minute. Corruption in downloaded files is much more frequent than that.

3 Accuracy of error detection is a tradeoff

"There can be safety tradeoffs with the addition of an error-detection scheme. As with almost all fault tolerance mechanisms, there is a tradeoff between availability and integrity. That is, techniques that increase integrity tend to reduce availability and vice versa. Employing error detection by adding a check sequence to a dataword increases integrity, but decreases availability. The decrease in availability happens through false-positive detections. These failures preclude the use of some data that otherwise would not have been rejected had it not been for the addition of error-detection coding". ([Koopman], p. 33).

As longer check sequences always increase the number of false positives, the optimal accuracy in the detection of errors is achieved by using the shortest check sequence that reduces the number of false negatives below the desired value. This value is frequently expressed as the mean time between false negatives (MTBF) for a given working rate. Once the MTBF reaches a large enough value (e.g., the 1e9 hours, about 114_079 years, set by FAA AC 25.1309-1A, page 15, for extremely improbable failure conditions), it does not make sense to continue increasing the size of the check sequence.

Lzip achieves optimal accuracy in the detection of errors by keeping the format small and simple, and this accuracy can't be improved by using a longer check sequence. The lzip format is formed by the compressed data, preceded by a header containing the parameters needed for decompression, and followed by a trailer containing integrity information. This design minimizes both overhead and false positives. The lzip format is free from any source of false positives beyond the header and trailer.

The probability of a bit flip in a lzip file resulting in a false positive is equal to 26 * number_of_members / file_size. One good thing about the false positives reported by lzip is that, because of its 3 factor integrity checking, in most cases they can be identified as false positives, and the data can be fully recovered without special tools.

In any case, it is important to always check the integrity of the decompressed data because decompression is not a trivial task and the check not only guards against corruption in the compressed file, but also against memory errors, undetected bugs in the decompressor, etc.

3.1 Why CRC32 instead of CRC32-C or CRC64?

Because a CRC32 offers by itself a MTBF longer than 1e9 hours for a decompression rate of up to 4 corrupt files per hour, which is perfectly adequate for most uses. The CRC32-C (Castagnoli) is slightly better for small uncompressed sizes than the CRC32 (IEEE 802.3 Ethernet) used by lzip, but for uncompressed sizes between 256 and 512 MiB, the CRC32 used by lzip is slightly better. Here "better" means "has a larger Hamming distance (HD)". (See [Koopman2]). In any case, the small differences in HD between CRC32 and CRC32-C are nullified by the error multiplication of decompression because the data errors not guaranteed to be detected by the CRC seem to contain a number of flipped bits much larger than the HD of both CRCs. A HD of 3 or 4 makes no difference when the error to be detected contains one hundred flipped bits.

The check sequence (for example a CRC) is not some pixie dust that magically detects corruption in the data. It is also data, and therefore it may also become corrupt, causing the unfortunate situation called "false positive", where the real data is intact but the decoder discards it as if it were corrupt. CRC64 is overkill for lzip because it doubles the probability of a false positive compared with CRC32 without significantly reducing the (already insignificant in the case of lzip) probability of a false negative.

4 Interaction between compression and check sequence

Verification of data integrity in compressed files is different from other cases (like Ethernet packets) because the data that can become corrupted are the compressed data, but the data that are verified (the dataword) are the decompressed data. Decompression can cause error multiplication; even a single-bit error in the compressed data may produce any random number of errors in the decompressed data, or even modify the size of the decompressed data.

Because of the error multiplication caused by decompression, the error model seen by the check sequence is one of unconstrained random data corruption. (Remember that the check sequence verifies the integrity of the decompressed data). This means that the choice of error-detection code (CRC or hash) is largely irrelevant, and that the probability of an error being undetected by the check sequence (Pudc) is 1 / (2^n) for a check sequence of n bits. (See [Koopman], p. 5). Note that if some errors do not produce error multiplication, a CRC is then preferable to a hash of the same size because of the burst error detection capabilities of the CRC.

Decompression algorithms are usually able to detect some errors in the compressed data (for example a backreference to a point before the beginning of the data). Therefore, if the errors not detected by the decoder are not "special" for the check sequence (neither guaranteed to be caught nor guaranteed to be missed), then the total probability of an undetected error (Pud) is the product of the probability of the error being undetected by the decoder (Pudd) and the probability of the error being undetected by the check sequence (Pudc):

Pud = Pudd * Pudc

It is also possible that a small error in the compressed data does not alter at all the decompressed data. Therefore, for maximum availability, only the decompressed data should be tested for errors. Testing the compressed data beyond what is needed to perform the decompression increases the number of false positives much more than it can reduce the number of undetected errors.

4.1 Interaction between LZMA compression and CRC32

The LZMA data stream provides embedded error detection. Any distance larger than the dictionary size acts as a forbidden symbol, allowing the decoder to detect the approximate position of errors, and leaving little work for the check sequence in the detection of errors.

The effects of the errors not detected by the LZMA decoder on the decompressed output can be classified in four types: the alteration of a single byte, a burst error spawning four bytes or less, the replacement of several occurrences of a concrete byte value by a different concrete byte value (for example the replacement of some 'a' characters by 'o' characters), and random corruption. The first two types of errors are guaranteed to be detected by the CRC. The third type of error is guaranteed to be detected by the CRC except when the number of bytes replaced is a multiple of 2^32 - 1, which is impossible in files smaller than 4 GiB - 1 byte uncompressed and very improbable in files larger than that. (The same value must be fed to a CRC32 2^32 - 1 times for the CRC value to repeat itself).

Because of the above, the real Pud of lzip is lower (better) than what a model of unconstrained random data corruption would predict.

5 Error models

Two error models have been used to measure the accuracy of lzip in the detection of errors. The first model consists of random bit flips in the compressed file. Just one bit flip per trial. The second model consists of zeroed 512-byte blocks, simulating a whole sector I/O error. Just one zeroed block per trial. The blocks are tried at every byte position because testing only blocks aligned to a 512-byte boundary would require a testing corpus too large. The first model is considered the most important because bit flips in the compressed data are the most difficult to detect by the LZMA decoder, and they happen even in the most expensive hardware [MSL].

Trial decompressions were performed using lziprecover and the 'unzcrash' tool included in the lziprecover package. In October of 2020 the option '-U, --unzcrash' was added to lziprecover, allowing it to run the tests much faster than the 'unzcrash' tool. The exact commands used for the two models are:

Single-bit errors:

lziprecover -U1 file.lz | lzip -9 -o file.lz.1nht.lz
unzcrash -b1 -p7 -s-20 'lzip -t' file.lz 2>&1 | lzip -9 -o file.lz.1nht.lz

Zeroed block errors:

lziprecover -UB512 file.lz | lzip -9 -o file.lz.512-00d.lz
unzcrash -B -d1 -p7 -s-24 'lzip -t' file.lz 2>&1 | lzip -9 -o file.lz.512-00d.lz

5.1 Estimated Pud of lzip for single-bit errors

13_777 million trial decompressions have been performed by running the commands shown in the previous section on real lzip files. (See section Links to test results for listings of the files tested). All those errors were detected by the decoder except 384, which were then detected by the CRC. Of those, 114 were of types guaranteed to be caught by the CRC (9 single-byte errors, 3 bursts, and 102 replacements), leaving 270 events of random data corruption not guaranteed to be caught by the CRC. (See Interaction between LZMA compression and CRC32).

For the single-bit error model, the Pudd of a LZMA decoder like the one in lzip is estimated from the data above (excluding the errors guaranteed to be caught by the CRC) to be about 1.9598e-8. (But it increases with file size, so it may be different for other files). This additional detection capability is supposed to reduce the Pud by the same factor respect to the Pudc of the CRC32 alone (about 2.3283e-10). Therefore, the estimated Pud for a bit flip in a lzip file, based on the data above, is of about 1.9598e-8 * 2.3283e-10 = 4.563e-18, which corresponds to a MTBF of 694 million years.

Note that the Pud of lzip can only be roughly estimated because not even one false negative has ever been observed. (And, if the estimate is accurate, it is not expected to ever find one in files larger than about 100 kB by testing).

Estimating lzip's Pud for single-bit errors on small files (a few kB compressed) is a pending task. Preliminary tests show that about 3 orders of magnitude more errors escape detection by the decoder because of the small size, but almost all of these errors seem to be of the kind that is guaranteed to be detected by the CRC.

5.2 Zeroed block errors

For the zeroed-block error model, the additional detection capability of a LZMA decoder is probably much larger than for a bit flip. A LZMA stream is a check sequence in itself, and large errors seem less probable to escape detection than small ones. In fact, we do not try to estimate the Pud of lzip for zeroed blocks because it is too small. The LZMA decoder detected the error in all the almost 1700 million trial decompressions run with a zeroed block, which suggests a lower bound for the MTBF longer than the age of the universe.

6 Links to test results

These are the listings of the results files produced by lziprecover and unzcrash for all the files tested to date. The files produced by lziprecover are much smaller because they only contain the interesting cases, excluding the errors easily caught by the decoder. The files themselves are not published because they are many, some are large, and all are easily reproducible from the corresponding source file.

7 Conclusions

Lzip achieves optimal accuracy in the detection of errors; false positives are minimized, and false negatives are reduced to an insignificant level. This allows lzip provide both high safety and high availability, which makes of lzip an excellent choice for reliable transmission, distribution, and archiving of important data.

After 14 years of testing, the MTBF of lzip can only be estimated because not even one false negative has ever been observed. If one were to continuously decompress corrupt lzip files of about one megabyte in size (10 decompressions per second), each of them containing the kind of corruption most difficult to detect (one random bit flip), then a false negative would be expected to happen every (1 / Pud) / 10 / 31_556_952(s/year) = 694 million years.

For the kind of files and errors tested here, the lzip format is safe enough as to easily meet the safety requirements of critical safety avionics systems. For decompression rates as high as (1 / Pud) / 1e9 / 3600 = 60_875 corrupt members per second, the probability of a false negative is lower than the catastrophic failure probability limit of 1e-9 per flight hour set by FAA AC 25.1309-1A. Additionally, the simplicity of the lzip format makes it easier for decompressor implementations to meet safety standards for software.

8 References

9 Glossary

^: Exponentiation symbol.
Accuracy: Freedom from mistakes. Is defined as 1 - inaccuracy.
Bit flip: Error that inverts a single bit value.
Burst Error: A set of bit errors contained within a span shorter than the length of the codeword. The burst length is the distance between the first and last bit errors. Zero or more of the intervening bits may be erroneous.
Check sequence: An error-detecting code value stored with associated data. A dataword plus a check sequence makes up a codeword.
Codeword: A dataword combined with a check sequence.
Cyclic Redundancy Check (CRC): A type of check sequence.
Dataword: The data being protected by a check sequence. A dataword can be any number of bits and is independent of the machine word size (e.g., a dataword could be 64 megabytes).
Failure rate: 1 False negatives per time unit (inverse of the MTBF). 2 False negatives per corrupt member decompressed (Pud).
False negative: Undetected error producing incorrect decompressed data.
False positive: Inability or refusal to decompress the data in a damaged compressed file even if the compressed data proper remain decompressible. (i.e., either the compressed data have not suffered any damage or the damage does not hinder decompression).
Inaccuracy: Ratio of error detection mistakes. Is defined as ( false_negatives + false_positives ) / total_cases.
Mean Time Between Failures (MTBF): Mean time between false negatives for a given decompression rate of corrupt members per time unit.
Overhead: Everything in the file except the compressed data proper. (Headers, check sequences, padding, etc).
Overkill: An unnecessary excess of whatever is needed to achieve a goal.
Pud: Probability of an undetected error (false negative). Pud is estimated as Pudd * Pudc.
Pudc: Probability of an error undetected by the check sequence.
Pudd: Probability of an error undetected by the decoder. Only errors not guaranteed to be caught by the check sequence (CRC) are counted when using Pudd to estimate Pud.
Year (Gregorian): 365.2425 days.

You are free to copy and distribute this article without limitation, but you are not allowed to modify it.

First published: 2020-12-27
Updated: 2022-07-31

This page does not use javascript.