Gfzip file format specification version 1.0

File format specification has been frozen and finalized on October 15 2006

 [image of the Head of a GNU]


Features

This short document describes version 1 of the gfzip file format. Gfzip is a file format that is usable for the compressed but yet non-sequential accessible storage of disk image data for computer forensic purposes. The gfzip file format combines the following features (in order of priority):
  1. User supplied meta-data is embedded in a meta-data partition within the file.
  2. Signed data and meta-data partitions using x509 certificates
  3. Bound signatures. The segment signatures in the files are bound together making it impossible to falsify meta-data by adding a meta-data section from an other file.
  4. Uncompressed disk images can be used the same way dd images are, as gfzip uses a data first design.
  5. Multi level SHA256 digest based integrity guards.
  6. Compressed or uncompressed storage of disk image data.
  7. Read access to compressed disk image data using non sequential seek/read methods.
  8. Support for packed storage. Reduction of required storage by way of duplicate block annihilation.
  9. Flags for sections of disk image data that can be set. For example: to mark sections as Bad
  10. Support for encryption.
  11. Support for storage of packed data in several archive files. This feature allows for the creation of archive files with compressed data that are referred to from multiple gfzip disk image files, thus reducing the amount of storage required for large archives of disk images.
  12. Support for the experimental data-reduction on acquire (ROA) packed storage. Here a digests only extract from an archive is used for data reduction on acquire.

Many of these features are also covered by the the Advanced Forensic Format, and there is a high level of similarity between the two projects. In fact, the gfzip file-format has been completely revised to increase the similarities wherever possible. The two projects merely prioritize different features. The gfzip file-format tries to work toward a common future standard file-format by using, where possible, an AFF dialect inside partitions of the gfzip file. At this point in time (proper/efficient) signing, packing, out of band flags, and dd compatibility are not supported by AFF. For gfzip these are core features. At the same time gfzip discards other features that AFF offers. In time it is hoped that through mutual effort, the two file formats could be combined into a single one that could offer a combination of all the features currently provided by the two separate file formats.

Use of AFF style segments

The gfzip file-format makes use of AFF style segments wherever possible and logical. In many cases this will add a little data overhead, as in the data section where record headers and footers are not used by gfzip. This overhead however is small enough to be acceptable in the light of potential compatibility and a possible merger of the gfzip file-format with the AFF file-format. The header and footer of an AFF segment has the following (pseudo) structure:
     struct affheader {
	  u_int32_t    magic // 0x41464600
	  u_int32_t    namelen=NAMELEN;
	  u_int32_t    datalen; 
	  u_int32_t    argument;
          char         name[NAMELEN];
     } 
     struct afffooter {
	  u_int32_t    magic; // 0x41545400
	  u_int32_t    seglen; 
     }
Please note that this structure definition differs from the AFF definition in the fact that AFF defines the format to use only big endian integer notation while gfzip accepts both little and big endian notation. If a file is meant to be AFF compliant, then big endian should be used throughout the whole file.
The rest of the document will when speaking of segments refer to segments of a partition that have the header and footer defined accordingly. All defined segments in this document define an appropriate segment name. Most segment names in this document are not defined for AFF, those will currently all have an "gfz." prefix to allow for AFF implementations to quickly distinguish and ignore them if they dont also support gfzip partitioning and signing.

AFF compatibility.

The default non packed mode for gfzip files is AFF compatibility mode. In this mode, gfzip files will hold some redundant segments in order to increase the compatibility with the AFF file format. The gfzip file format only specifies a subset of the AFF segment names. Other segment types are considered to be user level. Gfzip defines a small set of natively AFF compatible segments, a second small set of redundantly AFF compatible segment, and a few incompatible segments that should be for this reason be prohibited from being added as user supplied meta-data.

Natively compatible segments

section nameIn partition
pageNNNdata
md5digest
sha1digest

Redundantly compatible segments

section nameIn partitionRedundant with
pagesizedataFooter::bcount
imagesizemeta.acquireFooter::unc_size
pageNNN_md5digestDigest::gfz.digesttable
acquisition_technicianmeta.acquiremeta.acquire::gfz.x509

Incompatible segments


File partitions and partition sections

The gfzip files are built up from a set of partitions. A partition may contain a whole raw disk image while other partitions are built up out of sections. Gfzip tries to use an AFF dialect wherever reasonable for the sectioning of partitions. By this, depending on its usage, gfzip ends up with different (in)compatibilities Some partitions are made up of a single section while others contain multiple sections. The partitions that the gfzip format defines are:

There are different types of gfzip files, each with their own set of partitions: As stated, most partitions can have only a single instance in a single file, but the meta and crypto info partitions are important distinction. In the rest of this page we will describe the content of the sections in detail.

The footer section ("gfz.footer")

Gfzip has no header but uses a footer instead that has the actual function of a header. First a little note on why we choose not to use a header. Given that compression is only one of the many features that gfzip files offer, it is not unlikely that many users may want to use gfzip files for one or more of the other features. If we were to use a header instead of a footer and place this header in front of the uncompressed disk image data, than this would make this uncompressed disk image data unusable with the many tools that can take dd images as input. By taking the approach that the data precedes all the other partitions, we will for uncompressed data end up with a valid dd image (followed by the gfzip partitions). Given that the header would normally be the only fixed-size partition in the file, we now put this header at the very end of the file so we will be able to locate it without trouble. Thus we end up with a header-like structure at the end of the file, the footer partition/section. The footer section has the following structure:
struct gfz_footer {
    u_int64_t                  magic;
    u_int64_t		   unc_size;
    u_int64_t		   partitioningtable_offset;
    unsigned_char	   partition_count;
    unsigned char          version;
    unsigned char          compression;
    unsigned char          pltype;
    u_int32_t	           bcount;
}
We shall describe the different fields in this footer:

magic

the 64 bit magic value has a twofold purpose. First it is there to validate that the file actually is a gfzip file, and second it is there to allow the possibility to make endian-ness adjustments. The content of magic is the value 0x47465a7867667a58 which would have the ascii representation 'GFZxgfzX'. A gfzip implementation must use this value to check if the file was produced on a system with the same endian-ness as it is now running on, and could do conversions if there is a mismatch. If AFF compatibility is required, then this magic value and all other integer values should be put in network order, even if the system it is being created on is a little endian system.

unc_size

This value represents the size (in bytes) the disk image would have in uncompressed form.

partitioningtable_offset

This value points at the location in the current file where the partition record partition starts.

partition_count

This value describes the number of partition records in the partitioningtable segment.

version

This value represents the version of the gfzip file format. The first version of gfzip will implement only version 1.

compression

This value describes the type of compression used for the data partition. There are currently three different valid values defined:

pltype

This value describes the type of payload this gfzip file carries. The following values are defined:
Type 1 is meant mostly to allow for uncompressed gfzip files that can be used with existing tools that require raw dd images. Thus type 1 aims to provide dd compatibility.
Type 0 is meant to be the most AFF compatible way in which single non packed acquired images are stored. Type 0 aims to provide AFF compatibility.
Although packing becomes more important for larger sets of images, type 2 provides the possibility to define single images in a packed format.
An important feature that gfzip tries to allow for is efficient storage of large sets of images. To accomplish this, gfzip allows the data and/or the lookup partition to be located in an other file. From an archive point of view, there are the following possible configurations:

bcount

The bcount value is used to determine the number of blocks that are compressed at a time and are atomically addressable trough an indextable/digesttable entry. It holds the number of 512 byte blocks that are compressed and addressed as a single block of data. Please note that only powers of two are valid values for bcount. It is important to note some issues with respect to the bcount chosen and its consequences. While for compression, encryption and digest calculation a larger bcount value would yield higher performance and better compression, (ROA) packing yields better data reduction results with lower values for bcount. It is thus essential to choose a suitable value for bcount carefully.

The partition records partition

The partition records partition consists of a number of partition segments. Each partition segment denotes the type of the partition and the location and size of the partition within the file.

The partition segments ("gfz.part.XXXX")

The signed partition segments refer to specific partitions in the gfzip file, their signatures and the x509 certificate that was used for signing. A "gfz.part" has the following structure:
struct partition_ref {
    u_int64_t	      partition_start;
    u_int64_t	      partition_size;     
    u_int16_t      certificate_meta_partition;
    u_int8_t       encrypted;
    u_int8_t       signing_type;
    u_int16_t      parent_partition;
    u_int16_t      coparent_partition;
    char           signature[];
}
The partition_start and partition_size fields indicate the location and size of the partition. The certificate_meta_partition defines what number (order in the partition records partition) partition contains the x509 certificate used to sign the partition.
The gfzip file format has as one of its most crucial features the recording and guarding of the chain of custody. To accomplish this, the "gfz.part" section uses a simple signing method that is targeted at accomplishing this. The signing_type field is used to indicate what kind of position the partition has in the chain of custody. Depending on the value of signing_type the fields parent_segment and coparent_segment are filled with the number (order in the partition records partition) of the two partitions chained to this partition. Next to the chains thus defined, any partition created at the time of, or after the time of acquisition will also link to the digest of the digest partition. This is an additional guard to safeguard that parts of the chain of custody can be recovered in case of a corrupted partition in the base chain (either human or technical).
The signature field is essential to the whole integrity guarding process of gfzip files. It is created from a set of sha265 digests: These digests are concatenated together and are encrypted using the private key of the signee. The digests can thus be decrypted with the public key stored in the indicated x509 certificate, and can be used to validate both the integrity of the partition and the chain of custody as recorded by the gfzip file.
The naming of partitions of this type is done as follows:

Crypto info partitions.

gfzip allows the content of sections to be encrypted. This encryption is done per section but on a partition basis. That means for any given partition that either all sections in that partitions have encrypted payload or no sections in that partitions have encrypted payload.Each partition that has encrypted payload defines will have its own crypto info partition that holds information on the encryption of the partition it pertains to. We define that each encrypted partition is directly preceded by an appropriate crypto info partition.

Crypted sections in the preceding partition

If a partition is marked as being encrypted, the regular name of each section gets prefixed with cr. , and the normal argument moves in order to allow encryption specific information to get added as argument. The argument of the section is an index into the initialization vector table of the crypto info partition. The payload of an encrypted section consists of the following: The AFF argument refers to an entry in an initialization vector table that is defined in the crypto info partition.

Certificate section ("gfz.x509")

This section simply holds a x509 certificate that was used to sign this and/or some other partition in the gfzip file. This section is optional within a single meta-data section, but at least one such section should exist within the whole file. In unencrypted gfzip files this section will be stored in the meta data partition.

Crypto target sections ("gfz.ctarget")

Each target that should be able to read the payload x509 certificate info incorporated in the crypto info partition by means of a ctarget section. A ctarget section holds the x509 certificate of a target. Each distinct value of the section argument is defined as the id of this target and is used to locate the proper aeskey sections.

Crypto target sections ("gfz.aeskey")

This section holds a public key encrypted version of the AES128 key used to encrypt both the sections of the following data section, and the ivtables in this crypto info partition.The section argument refers to the ctarget section that had its public key used for the encryption of this section. Next to the AES128 key, the section holds a SHA256 digest of this key used for validation purposes.
  struct ivtable {
     char[16]          aeskey;
     char[32]          digest;
  }

Crypto IV-table sections ("gfz.ivtable")

Given that we want to keep the random access property of gfzip files intact even with crypto in place we need to encrypt each section separately without dependence on previously encrypted sections. Given that initializing public key encryption is very time consuming, doing fully separate public key encryption on each section would be to expensive. The alternative of just encrypting a symmetric key using public key encryption and using this key over and over again has the disadvantage that it significantly weakens the security. Thus we need to look for a compromise solution. Block based algorithms like AES are often used with Cipher Block Chaining in order to allow data that is multiple blocks in size. When using AES-CBC the result of each block processed is used as an input vector for enciphering the next block. The first block uses an initialization vector as there exists no previous block. If we use the same key, but a different random initialization vector, we should be able to approach the same security level as reached with continuous encryption. This means we will need to store a copy of the thus created initialization vector table. The "gfz.ivtable" sections hold an AES128 encrypted version of a big array of 16 byte long initialization vectors. An index into this array is used in the argument of the encrypted sections to point to the proper entry used to initialize encryption for that section.

The data partition

Depending on the compression, pltype and bcountp values as defined in the footer section, the compressed data partition will hold chunks (sections) of (compressed or uncompressed) data that are concatenated together to form one big data section. When the footer defines pltype to be AFF, then the data partition will actually be formed out of AFF sections with an AFF header and footer, and will be prefixed with a segsize and imagesize AFF segment purely for the purpose of AFF compatibility. Each chunk of (compressed) data represents a chunk of uncompressed data with a size that is determined by the bcount value in the footer and that is compressed using the method defined by the compression value in the footer. Pleaser note that the final chunk of compressed data could represent a smaller chunk of data than that defined by bcountp, given that the complete size of the uncompressed data may not be exactly N times the defined chunk size. Each chunk of compressed data is referenced both by an entry in the "gfz.indextable" and "gfz.digesttable" sections, and when present also the "gfz.sizes" section. It is crucial to understand how the different storage methods are represented in this partition. For pltype 1 (raw) this partition is an exact copy of the raw dd image, or if compression is defined is an ordered concatenation of all compressed data chunks without any padding, header or footer. If pltype is defined as 0 then this partition will essentially exist of ordered AFF data segments with header and footer. Further for type 0 the data partition will be prefixed with an AFF file header. If pltype is defined as packed (2,6,7,9) than the data is stored as concatenated chunks of compressed data as is used for type 1, but the data chunks may be stored out of order, that is data chunks will be stored without duplicates and any second occurrence will refer to the storage of the first occurrence. Packed storage is currently incompatible with the AFF format, but it is hoped that future versions of AFF will allow for packed storage also. When the experimental ROA packing is used, the same techniques are used as for regular packed images, however data chunks with known (and thus archived) digests will have no representation stored in the data partition.

The digest partition

Image digest sections

The digest partition requires a set of image wide digests to be defined. In line with the AFF file-format these digests are AFF records with the following names: All 3 these sections are mandatory within the digest section.

The data digest table section "gfz.digesttable"

The data digest table partition is a large array of 32 byte long SHA256 numbers. Each value represents the SHA256 digest of the uncompressed version a data chunk as found on the original medium. This means that after decompression of a compressed chunk of data, the integrity can be checked using the digest stored in this section.

The lookup partition

The lookup partition allows the location of the appropriate section of compressed or uncompressed data in either the image file or in the archive that the image file is linked to.

The index table partition/section ("gfz.indextable")

The index table section is a large array of 64 bit native order u_int64_t values. Each value points to the beginning of a chunk of compressed data in the data partition of which the uncompressed size is determined by the bcountp in the footer section. In this way, the indextable can be used to look up where the compressed version of data at a specific offset in the uncompressed data is to be found, Using the indextable it becomes possible to use non-sequential access to the compressed data, given that each chunk of data is compressed separately and is addressable using the indextable section. Unless a "gfz.sizetable" is defined the index table section will hold one final u_int64_t value in order to be able to determine the size of the final chunk of compressed data. That is, without the gfz.sizetable, the size of the compressed disk section is taken to be the offset of the next compressed section minus the offset of the current.

The size table section ("gfz.sizetable")

If an image file is packed it will not be possible to know the size of a compressed chunk of raw compressed data by looking at the next offset in the index table, given that packed files wont always respect the order of the data. Using AFF headers/footer might help, but AFF has a sequentiality that currently will be broken by packed files, thus packed files can not have an AFF data partition either. To overcome the problem with packing, the csizes section is defined. This section holds a big array of 32 bit native order unsigned numbers representing the compressed sizes of the sections referred to by the index table.

Archive lookup section ("gfz.lookup")

As an alternative to embedded "gfz.sizetable" and "gfz.indextable" sections, these sections may be stored in the lookup partition of a separate lookup file. By doing this, the image files will remain unchanged if an archive gets updated, and only the lookup files will get updated on merges of archives. The payload of this section is an identifier for the lookup file. Please note that usage of references is not defined by the file format, this detail is left as a degree of freedom to archive implementations.

Multi file archive reference sections ("gfz.archive")

With big archives, but also for use with acquirement procedures that mandate small image files ,the data partition may not actually be in a single file, but may be stored in multiple files that handle a separate part of the available data. The gfzip file-format expects the data in packed multi file archives to be distributed over the multiple files by dividing the SHA256 address space up. The gfz.archive section as its argument holds the number of archive sections the archive consists of. This value is only allowed to be a power of two not exceeding 512 (the reason for this is the maximum number of open file handles that is often 1024). Its content is an identifier for the archive. Please note that usage of references is not defined by the file format, this detail is left as a degree of freedom to archive implementations.
If "gfz.archive" is defined, then the compressed and packed data sections are located as follows:

The archive content partition

If an archive gets updated, any file holding a lookup partition that refers to it should be updated with it. For this reason, the gfzip file-format specifies a partition that holds a list of references to member images that refer to the archive and should get updated whenever the archive gets updated.

The archive member sections ("gfz.member")

The content of a "gfz.member section is a reference to an image that itself contains a lookup-table that pertains to the archive. Please note that usage of references is not defined by the file format, this detail is left as a degree of freedom to archive implementations.

The meta partition

Certificate section ("gfz.x509")

This section simply holds a x509 certificate that was used to sign this and/or some other partition in the gfzip file. This section is optional within a single meta-data section, but at least one such section should exist within the whole file. In encrypted gfzip files this section will be stored in the crypto info partition.

Flags sections ("gfz.flags")

Both the AFF file-format and the earlier draft of the gfzip file format define specific bad block data sections. The use of these sections is not allowed in the gfzip file format, as it is believed to be forensically incorrect. The flags section as defined here is an alternative to the bad-data approach. The flags section defines a 32 bit bitmap overlay for the data and is defined separately from the data and its integrity guards. Given that we use a 64 bit unsigned u_int64_t for addressing uncompressed data, and given that sections of data that have a flag set for them can differ greatly in size, we will use a simple tree structure for our flags section. We will subdivide the 64 bit address into 8 parts of 8 bit each. The flags section will start of with an appropriate set of 256 records needed to address the most significant byte in the 8 byte address space. A flags record will have the following structure:
struct ftable_rec {
   u_int32_t  subtable;
   u_int32_t  orflags; 
   u_int32_t  andflags;
}
struct ftable_node {
   struct ftable_rec subrec[256];
}
subtable
This value will point to the record-set number describing the next byte in order of significance of the 64 bit address space. If this has a value 0, than the highest level of precision has been found, and orflags and andflags will both hold exactly the same value.
orflags
This value describes which of the 32 flags are set anywhere in the address space addressed at this level of significance.
andflags
This value describes which of the 32 flags are set everywhere in the address space addressed at this level of significance. Currently the folowing flags are defined for gfzip: The flags section has an aff style argument defined, If this argument has a value of 0, than the flags must be interpreted as above. If it has any other value, then the interpretation of the flags is application defined and both the argument value and a flagnamespace section with the same AFF argument value. By doing this, the flags section becomes an extensible mechanism for localized metadata.

Flag namespace sections ("gfz.flagnamespace")

As a way to extend the localized metadata facilities offered by the flags mechanism, this section offers a facility to extend the standard available (defined or reserved) flags that are available in a standard flags section. A flags section by its argument can be used to identify an alternative namespace by number. The flag namespace if it has an argument that is equal to that of an acompanying flags section, is used to define the name of the namespace. By doing this, the flags section becomes an extensible mechanism for localized metadata. The content of this secion is an UTF8 string defining the namespace name.

Meta data sections

A meta data partition can contain many key/value pairs that contain meta information about the image or the procedural chain of custody the image went through from the moment of acquisition until the present. Even meta data partitions can exist that contain pre-acquisition information that is crucial to the chain of custody.

Mandatory metadata

Most metadata is optional, there is however a small set of metadata keys that form mandatory sections in each or in specific metadata partitions.

Mandatory for each partition

Next to these few mandatory metadata sections, the mandatory x509 signing provides aditional mandatory meta.

Metadata attribute and typing conventions

Although usage of the AFF argument for user level metadata is considdered to fall outside of the actual gfzip file-format specification, it is sugested to use any not explicitly used argument for human parsable metadata as a hint to metadata typing. The folowing types and attribute values are defined: Please note that this typing is not mandatory, it is mainly sugested in order to accomodate connectivity to the open computer forensics architecture.

Sugestions for metadata and metadata partition naming

The rest of this document defines a minimal chain of custody schema that is suggested for chain of custody purposes. It is in no way mandatated by the file specification, and it may soon become outdated by work being done by the CDESF working group. It is not sure however if this working group will incorporate apropriate COC guards into the generic schema. If not, this recomendation will get updated according to user input. Please note this is not actualy part of the file format specification, it is listed here for convenience reasons only.

Sugestions for meta data partitions.

Although gfzip has an open and extensible naming convention for meta-data partitions, it is considered wise to adhere to the following naming conventions for these sections wherever appropriate. Meta-data partitions will in their partition record and signing always declare the parent partition as being the last preceding meta partition. A "gfz.part.meta.acquire" may declare a "gfz.part.meta.warrant" partition as parent given that there would be no other existing meta partitions yet. An "gfz.part.meta.orgtransfer" partition in general would declare a "gfz.part.meta.case" as coparent. By the use of parent and coparent chaining in the partitions and their signing scheme with meta data partition embedded x509 certificates, the complete chain of custody is recorded and is cryptographically guarded.

typical chain of custody for an acquired image

chain of custody after packing

chain of custody after archiving

Sugestions for metadata usage

partitionsectiondescription
case/warrantcase_numAn ID for the case as provided by the procecutor
warrantgfz.warrant_numAn ID for the warrant as provided by the pocecutor
warrantgfz.warant_not_beforeStart of valid acquirement window
warrantgfz.warant_not_afterEnd of valid acquirement window
warrant/acquire,phys_acquire,orgtransfergfz.locality_country
warrant/acquire,phys_acquire,orgtransfergfz.locality_sopstate or province
warrant/acquire,phys_acquire,orgtransfergfz.locality_city
warrant/acquire,phys_acquire,orgtransfergfz.locality_street
warrant/acquire,phys_acquire,orgtransfergfz.locality_building
warrant/acquire,phys_acquire,orgtransfergfz.locality_appartment
acquire,phys_acquire,orgtransfergfz.locality_room
acquire,phys_acquiregfz.locality_esource_locationlocation in the room the evidence source was found
acquiregfz.locality_medialocation in the evidence source where the media was found
acquire,phys_acquiregfz.esource_idunique id (within case) of the evidence source.
acquiregfz.media_idunique id (within evidence source) of the media.
orgtransfergfz.organisation
orgtransfergfz.ouorganisational unit
orgtransfergfz.depdepartment
orgtransfergfz.representative
reductiongfz.reductionset_id
repairgfz.damagereport
declaregfz.declaration_reasonwhy was the data declared insubmissable as evidence


Verbatim copying and distribution of this entire article are permitted worldwide, without royalty, in any medium, provided this notice, and the copyright notice, are preserved.