## 5.3. Internal file formats

### 5.3.1. Preface

Please let us know if you find any other internal files, figure out formats of any internal files or find out what unknown parts of the files below do. Any and all contributions will be fully attributed and, if appropriate, co-copyright given.

In this section, where the description of a file says that an item is an offset into another file, that file may be located in the same CHM, or it may be located in an accompanying CHI file.

The different types of ITSF files contain different internal files. The list below indicates which file types contain which internal files:

CHI
CHM

/#ITBITS, /#SYSTEM, /#IDXHDR, /#STRINGS, /#TOCIDX, /#TOPICS, /#URLSTR, /#URLTBL, /#IVB, /#SUBSETS, /#WINDOWS, /$FIftiMain, /$OBJINST, /$WWAssociativeLinks/BTree, /$WWAssociativeLinks/Data, /$WWAssociativeLinks/Map, /$WWAssociativeLinks/Property, /$WWKeywordLinks/BTree, /$WWKeywordLinks/Data, /$WWKeywordLinks/Map, /$WWKeywordLinks/Property

CHQ

/$FIftiMain, /$OBJINST, /$TitleMap CHW hh.dat KPD Seen in HHA.dll or on the internet, but not seen in any ITSF files #GRPINF (see helpdeco docs by Manfred Winterhoff for a possible function), #INFOTYPES (probably will be output when MS implements information types), #URLS (probably a previous incarnation of the #URLTBL + #URLSTR combination), #BSSC (8 bytes, based on something I saw in KeyTools.exe it might contain the version of RoboHelp used to create the CHM. If you have RoboHelp please check this out & be sure to send in your chm.) ### 5.3.2. #ITBITS The files I have seen so far have been empty or filled with zero BYTEs so who knows. My guess is that it has something to do with information types. The file where it had a non-zero size (12 zero BYTEs in VOICESDK.CHI from the MSDN) also had a non-zero #SYSTEM code 15 (Information type checksum) entry of 0xFFFFFFFF. ### 5.3.3. #SYSTEM The #SYSTEM file begins with a DWORD, which is a version number. It is 2 in files compiled with Compatibility set to "1.0" or 3 in files compiled with Compatibility set to 1.1 or later. Other values have not been found. It is followed by #SYSTEM entries to the EOF, which have the following format: Table 5.1. The format of #SYSTEM entries. Offset Type Comment/Value 0 WORD code - see below for values & meanings 2 WORD length of data 4 BYTEs data In the below list of the different codes the order of the codes in the #SYSTEM file is 10, 9, 4, 2, 3, 16, 6, (5,0,1 or 0,1,5 - haven't been able to make files with all three), 7, 11, 12, 13, 14, 8 and lastly 15. Table 5.2. An explanation for each of the #SYSTEM codes. Code[a] Explanation 0 Value of Contents file in the [OPTIONS] section of the HHP file. NT 1 Value of Index file in the [OPTIONS] section of the HHP file. NT 2 Value of Default topic in the [OPTIONS] section of the HHP file. NT 3 Value of Title in the [OPTIONS] section of the HHP file. NT 4 28 (HHA Version 4.72.7294 and earlier) or 36 (HHA Version 4.72.8086 and later) byte structure: Table 5.3. The format of the code 4 #SYSTEM entry. Offset Type Comment/Value 0 DWORD LCID from the HHP file. 4 DWORD One if DBCS is in use. 8 DWORD One if full-text search is on. 0xC DWORD Non-zero if the file has KLinks. 0x10 DWORD Non-zero if the file has ALinks. 0x14 QWORD timestamp. Derived from GetSystemTimeAsFileTime, as called by the compiler program. No conversion, just stored as two little endian DWORDs, dwLowDateTime first, then dwHighDateTime. 0x1C DWORD 0/1 (unknown) Only dsmsdn.chi from the MSDN has 1 here. Perhaps 1 means it is the root chm of a collection? 0x20 DWORD 0 (unknown) 5 Value of Default Window in the [OPTIONS] section of the HHP file. NT 6 Value of Compiled file in the [OPTIONS] section of the HHP file. This is the lowercase of the stem of the CHM file name. If the name of the CHM is ..\bar\foo\ FOO-Bar . chm jimmy is a poo-bum then this will be foo-bar . NT 7 [a]DWORD present in files with Binary Index turned on. The entry in the #URLTBL file that points to the sitemap index had the same first DWORD. 8 Rare. VOICESDK.CHM & CHI and WOSA.CHI from the MSDN have one. The abbreviations and explanations seem to be the same in WOSA.CHI & VOICESDK.CHM, except for 2 mistakes (one in VOICESDK.CHM & one in WOSA.CHI) that seem to be created by bugs in the compiler. Both were compiled by the same version of HHA (4.72.8086), so perhaps this version has some weird bug. Each entry is 16 BYTEs: Table 5.4. The format of the code 8 #SYSTEM entry. Offset Type Comment/Value 0 DWORD 0, 4 in some (unknown) 4 DWORD Offset in #STRINGS file. An abbreviation. 8 DWORD 3 where 1st DWORD is 0, 5 where it is 4 (unknown) 0xC DWORD Offset in #STRINGS file. An explanation of the abbreviation. 9 The version/program that the CHM was compiled by - shown in the version dialog as Compiled with %s where %s is what is in this entry of the #SYSTEM file. If compiled with the MS HTML Help Author dll then it will be something like HHA Version 4.74.8702. It comes directly from the resource strings of HHA.dll (I saw it there in Unicode and successfully altered it). Beware that the text control in the version dialog that displays it is only so big and in some cases the string won't be displayed, & in other cases only part, depending apon the effect of wrapping, so if you write a compiler, be sure to test it and use a short name and version. Usually NT, but HH won't crash if it isn't. 10 time_t timestamp (DWORD). Derived from GetLocalTime. Not sure of the conversion yet. 11 [a]DWORD present in files with Binary TOC turned on. The entry in the #URLTBL file that points to the sitemap contents has the same first DWORD. 12 [a]DWORD. Number of information types. 13 [a]The #IDXHDR file contains exactly the same bytes. See below for more info 14 Rare. The ones I saw were from MS Word 2000. My guess is that it is an MSOffice extension (or maybe not) that overrides the names & window types of the navigation tabs. DWORD number of windows to override, 2 ANSI/UTF-8 NT strings for each window. The first is the text for the tab & the second is probably the name of the window type to use. (eg 2, &Answer Wizard\0MsoHelpAWDlg\0&Index\0MsoHelpKeyDlg\0) These are from the Custom tab variables of the [OPTIONS] section of the HHP file. The resources from MSOHELP.EXE have a weird .reg file that gives the CLSIDs involved in the provision of these dialogs. 15 [a]DWORD. Information type checksum. Unknown algorithm & data source. 16 Value of Default Font in [OPTIONS] section of the HHP file. NT [a] Code 17 to code 65535 have not yet been seen. Please let us know if you see these. [a] Not present in files compiled with Compatibility set to 1.0. ### 5.3.4. #IDXHDR This has exactly the same bytes as the code 13 entry in the #SYSTEM file and is 4096 bytes long. Table 5.5. The format of the #IDXHDR file. Offset Type Comment/Value 0 char[4] T#SM 4 DWORD Unknown timestamp/checksum 8 DWORD 1 (unknown) 0xC DWORD Number of topic nodes including the contents & index files 0x10 DWORD 0 (unknown) 0x14 DWORD Offset in the #STRINGS file of the ImageList param of the "text/site properties" object of the sitemap contents (0/-1 = none) 0x18 DWORD 0 (unknown) 0x1C DWORD 1 if the value of the ImageType param of the "text/site properties" object of the sitemap contents is Folder. 0 otherwise. 0x20 DWORD The value of the Background param of the "text/site properties" object of the sitemap contents 0x24 DWORD The value of the Foreground param of the "text/site properties" object of the sitemap contents 0x28 DWORD Offset in the #STRINGS file of the Font param of the "text/site properties" object of the sitemap contents (0/-1 = none) 0x2C DWORD The value of the Window Styles param of the "text/site properties" object of the sitemap contents 0x30 DWORD The value of the ExWindow Styles param of the "text/site properties" object of the sitemap contents 0x34 DWORD Unknown. Often -1. Sometimes 0. 0x38 DWORD Offset in the #STRINGS file of the FrameName param of the "text/site properties" object of the sitemap contents (0/-1 = none) 0x3C DWORD Offset in the #STRINGS file of the WindowName param of the "text/site properties" object of the sitemap contents (0/-1 = none) 0x40 DWORD Number of information types. 0x44 DWORD Unknown. Often 1. Also 0, 3. 0x48 DWORD Number of files in the [MERGE FILES] list. 0x4C DWORD Unknown. Often 0. Non-zero mostly in files with some files in the merge files list. 0x50 DWORD[1004] List of offsets in the #STRINGS file that are the [MERGE FILES] list. Zero terminated, but don't count on it. ### 5.3.5. #WINDOWS This file contains information on the window types in the CHM. It has the following format: Table 5.6. The format of the #WINDOWS header. Offset Type Comment/Value 0 DWORD Number of entries in the file 4 DWORD Size of each of the entries in the file (188 or 196) 8 #WINDOWS entries to the EOF #WINDOWS entries are basically HH_WINTYPE structures as specified in htmlhelp.h. Note the first DWORD can be used to specify different versions of this structure. Also note that the HHW docs show a different structure to htmlhelp.h. Therefore many CHM files need to be surveyed to find structures with sizes other than 188 or 196. In the description of #WINDOWS entries below, Arg n means that that item is argument n of the window definition in the HHP file, either converted to a DWORD or to an offset in the indicated file: Table 5.7. The format of each #WINDOWS entry. Offset Type Comment/Value 0 DWORD Size of the entry (188 in CHMs compiled with Compatibility set to 1.0, 196 in CHMs compiled with Compatibility set to 1.1 or later) 4 DWORD 0 (unknown) - but htmlhelp.h indicates that this is "BOOL fUniCodeStrings; // IN/OUT: TRUE if all strings are in UNICODE" 8 DWORD Arg 0. Offset in #STRINGS file. 0xC DWORD Which window properties are valid & are to be used for this window. See the table below. 0x10 DWORD Arg 10. 0x14 DWORD Arg 1. Offset in #STRINGS file. 0x18 DWORD Arg 14. 0x1C DWORD Arg 15. 0x20 RECT Arg 13. Order left, top, right & bottom. 0x30 DWORD Arg 16. 0x34 DWORD 0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndHelp; // OUT: window handle" 0x38 DWORD 0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndCaller; // OUT: who called this window" 0x3C DWORD 0 (unknown) - but htmlhelp.h indicates that this is "HH_INFOTYPE* paInfoTypes; // IN: Pointer to an array of Information Types" 0x40 DWORD 0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndToolBar; // OUT: toolbar window in tri-pane window" 0x44 DWORD 0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndNavigation; // OUT: navigation window in tri-pane window" 0x48 DWORD 0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndHTML; // OUT: window displaying HTML in tri-pane window" 0x4C DWORD Arg 11. 0x50 BYTE[16] 0 (unknown) - but htmlhelp.h indicates that this is a RECT that is "RECT rcHTML; // OUT: HTML window coordinates" & the HHW docs say "Specifies the coordinates of the Topic pane." 0x60 DWORD Arg 2. Offset in #STRINGS file. 0x64 DWORD Arg 3. Offset in #STRINGS file. 0x68 DWORD Arg 4. Offset in #STRINGS file. 0x6C DWORD Arg 5. Offset in #STRINGS file. 0x70 DWORD Arg 12. 0x74 DWORD Arg 17. 0x78 DWORD Arg 18. 0x7C DWORD Arg 19. 0x80 DWORD Arg 20. 0x84 BYTE[20] 0 (unknown) - but htmlhelp.h indicates that this is "BYTE tabOrder[HH_MAX_TABS + 1]; // IN/OUT: tab order: Contents, Index, Search, History, Favorites, Reserved 1-5, Custom tabs" 0x98 DWORD 0 (unknown) - but htmlhelp.h indicates that this is "int cHistory; // IN/OUT: number of history items to keep (default is 30)" 0x9C DWORD Arg 7. Offset in #STRINGS file. 0xA0 DWORD Arg 9. Offset in #STRINGS file. 0xA4 DWORD Arg 6. Offset in #STRINGS file. 0xA8 DWORD Arg 8. Offset in #STRINGS file. 0xAC BYTE[16] 0 (unknown) - but htmlhelp.h indicates that this is a RECT that is "RECT rcMinSize; // Minimum size for window (ignored in version 1)" Everything after here is only present in CHMs compiled with Compatibility set to 1.1 or later. 0xBC DWORD 0 (unknown) - but htmlhelp.h indicates that this is "int cbInfoTypes; // size of paInfoTypes;" 0xC0 DWORD 0 (unknown) - but htmlhelp.h indicates that this is "LPCTSTR pszCustomTabs; // multiple zero-terminated strings" Table 5.8. Flags used to specify which values are valid. Value[a] Valid property 0x00000002 Navigation pane style. 0x00000004 Window style flags. 0x00000008 Window extended style flags. 0x00000010 Initial window position. 0x00000020 Navigation pane width. 0x00000040 Window show state. 0x00000080 Info types. 0x00000100 Buttons. 0x00000200 Navigation Pane initially closed state. 0x00000400 Tab position. 0x00000800 Tab order. 0x00001000 History count. 0x00002000 Default Pane. [a] The rest of the values either do nothing or are unknown. Please let us know if you find out what they are. ### 5.3.6. #STRINGS This file is a list of ANSI/UTF-8 NT strings. The first is just a NIL character so that offsets to this file can specify zero & get a valid string. The strings are sliced up into blocks that are 4096 bytes in length. If a string crosses the end of a block then it will be cut off without a NT and repeated in full, with a NT, at the start of the next block. For eg "To customize the appearance of a contents file" might become "To customize the <block ending>To customize the appearance of a contents file" when there are 17 bytes left at the end of the block. The last block will be smaller than 4096 bytes if the strings do not fill it up. The strings are in this order; "", [WINDOWS] (Arg 0, Arg 1, Arg 7, Arg 9, Arg 2, Arg 3, Arg 4, Arg 5, Arg 6, Arg 8) no. n..., Contents_0_Entry_title, Index_0_Keyword, Contents_Image_file, Contents_Font, Contents_Default_frame, Contents_Default_window, [MERGE FILES] no. n... ### 5.3.7. #TOCIDX Present in files with a non-empty contents file, when Binary TOC is on and Compatibility is set to 1.1 or later. This file is made up of 0x1000 byte blocks, but this is only apparent because of extra bytes interrupting what would otherwise be a stream of 20/28 byte structs. If the other parts (DWORDs & 16 byte structs) didn't fit into these blocks then presumably this would show up in the other parts too. The first block is the header: Table 5.9. The format of the #TOCIDX header. Offset Type Comment/Value 0 DWORD 4096/header length/offset of no. 1 below 4 DWORD offset of no. 3 below 8 DWORD number of no. 3 below 0xC DWORD offset of no. 2 below 0x10 BYTE[4080] 0 (unknown) The header is followed by the following different types of structs in the specified order: 1. 20/28 byte structs (pages/books) 2. list of DWORDs pointing into the #TOPICS file 3. 16 byte structs - links above stuff First all the top level books/pages, then the next level, then the next & so on Table 5.10. 20/28 byte structures Offset Type Comment/Value 0 WORD 0 (unknown) 2 WORD Unknown 4 DWORD Seems to be a bit field: 0x2 is whether or not New is turned on, 0x4 is set when the entry is a book/has children and 0x8 is set when the entry has a Local value. The other bits are unknown (0x1, 0x40, 0x100 are sometimes set on books). 8 DWORD Unknown. In some cases it is an index into the #TOPICS file of the entry containing offsets to the title & filename. 0xC DWORD Offset to the parent book. 0x10 DWORD Offset to the next book/page in the current book/page. The next two DWORDs are only present in books (28 byte structs) 0x14 DWORD Offset to the first child of the book. 0x18 DWORD 0 (unknown) Table 5.11. 16 byte structures Offset Type Comment/Value 0 DWORD Offset into no. 1 above. 4 DWORD Some kind of sequence number that is incremented by one and starts at 666. I swear :) 8 DWORD Offset into no. 2 above. Can contain RAM litter. 0xC DWORD Index in #TOPICS file of the entry containing offsets to the title & filename. Can contain RAM litter. ### 5.3.8. #TOPICS Note Note that an index into this file should be multiplied by 16 to get the offset of the entry. An index into this file can be converted to an offset in the #URLTBL file, without reading this file using the following formula: offset = (index%341)*12 + index/341*4096 This file contains information on the topics present. Each entry has the following format. Table 5.12. The format of the #TOPICS file entries. Offset Type Comment/Value 0 DWORD Offset into the tree in the #TOCIDX file. 4 DWORD Offset in #STRINGS file of the contents of the title tag or the Name param of the file in question. -1 = no title. 8 DWORD Offset in #URLTBL of entry containing offset to #URLSTR entry containing the URL. 0xC WORD 2 indicates not in contents, 6 indicates that it is in the contents, 0/4 something else (unknown) 0xE WORD 0, 2, 4, 8, 10, 12, 16, 32 (unknown) ### 5.3.9. #URLSTR This file is made up of 0x4000 byte blocks. If the last block is not filled then it will be smaller than 0x4000 bytes. The free space at the end of the blocks is filled with NUL bytes. The blocks contain the following elements one after another: An unknown BYTE. So far this has been 0, 0x42 and in spechsdk.chi it was 0x49. Does not indicate presence/absence of URL/FrameName strings. This is followed by pairs of URL, FrameName strings (both NT) from the HHC. Then come all the Local strings from the HHC: Table 5.13. The format of the #URLSTR entries. Offset Type Comment/Value 0 DWORD Offset of the URL for this topic. 4 DWORD Offset of the FrameName for this topic. 8 ANSI/UTF-8 NT string that is the Local for this topic. There is one way to tell where the end of the URL/FrameName pairs occurs: Repeat the following: read 2 DWORDs and if both are less than the current offset then this is the start of the Local strings else skip two NT strings. ### 5.3.10. #URLTBL An offset in this file can be converted to an index into the #TOPICS file, without reading this file using the following formula: index = ((offset%4096)+((offset/4096)*4096-4))/12 Each 0x1000 byte block has the following format. Table 5.14. The format of the #URLTBL blocks. Offset Type Comment/Value 0 DWORD[3][341] 341 entries. 12 bytes each. 0xFFC DWORD 4096 (unknown) Possibly the length of the block? That MS would pull this kind of shit is really annoying; they should have just put all the entries one after another, not stuffed in an arbitrary DWORD after every 4092 bytes. From this and other blockness I guess they are optimizing for the Wintel platform. Each entry has the following format. Table 5.15. The format of the #URLTBL block entries. Offset Type Comment/Value 0 DWORD Unknown. I suspect that this is either some kind of unique ID or two WORDs. 4 DWORD Index of entry in #TOPICS file. 8 DWORD Offset in #URLSTR file of entry containing filename. ### 5.3.11. #IVB This is basically the [ALIAS] section of the HHP file. The #IVB file begins with a DWORD indicating the total size of all the #IVB entries, which then follow that DWORD. #IVB entries have the following format. Table 5.16. The format of the #IVB entries. Offset Type Comment/Value 0 DWORD The value of the alias 4 DWORD Offset in #STRINGS file of the file to show ### 5.3.12. #SUBSETS This file is present when the [SUBSETS] section is present in the HHP file. Table 5.17. The format of the #SUBSETS header. Offset Type Comment/Value 0 WORD 0 (unknown) 2 WORD Number of bytes taken up by the subset entries. The subset entries currently seem to be garbage left over from previous usage of the same memory locations. Based on the number of bytes per non-whitespace line in the [SUBSETS] section each subset entry is 12 BYTEs in length. ### 5.3.13.$FIftiMain

#### 5.3.13.1. Preface

I would like to acknowledge Jed Wing for reverse-engineering this file and Razvan Cojocaru for pointing out clarity issues in this section.

This file stores information for the full-text search function of HTML Help. It prevents the need for viewers of CHM files to search each and every topic in the CHM to find words or phrases. This speeds up searching considerably, since the index that this file stores, contains data on which words occur in which files and at which locations.

This file will be empty when Full-text search has been turned off or when no files have been indexed. The CHM compiler will only index those files that contain .h in their names. All word sorting, processing and storage is done case-insensitively and is not case-preserving. If you have a word longer than 99 characters in a HTML file then it seems the indexing routines will die during indexing of that file and then skip on to the next one.

The full-text search data is structured as a tree that is more similar to the ITSP directory than the BTree file. The file begins with a header. This is followed by index nodes, leaf nodes and word location codes (WLCs). The index and leaf nodes are a fixed size (specified in the header) and the WLCs are variably sized (specified in the leaf nodes). There is a WLC block just before each leaf node and the pairs of WLC blocks and leaf nodes are interspersed with index nodes. Index nodes only point to nodes (index or leaf) that are further down the tree and at the same time located colser to the start of the file. The index nodes are directly after the last of their child nodes in the file. The structure of this format is best explained by the following diagram of a ficticious $FIftiMain file. For a more extensive set of examples see the example section This file makes use of two different ways of encoding integers in variable length fields; the so called scale and root method and a variant of the ENCINT method used in the PMGL/PMGI directory chunks from the ITSF format. For the ENCINTs in this file the bytes are stored least significant first (little endian), whereas in the PMGL/PMGI chunks they are stored most significant first (big endian). For CHM viewers that want to quickly find out which files a word is in, the procedure is as follows; read the header, seek to the root index node, search the root index node for a word that is less than or equal to the desired, descend to the next index level, repeat the previous two steps as many times as the tree is deep, then search the resulting leaf node until the desired word is found, read the correct part of the WLCs for that leaf node and extract the topic numbers for that word. Once all the topic numbers have been found for the desired words, make sure there are no duplicates and extract the titles and urls from the #TOPICS, #URLTBL, #URLSTR and #STRINGS files. For sample code for decoding$FIftiMain files, see chmdeco, xCHM or dump_fiftimain.c from the old html versions of this specification.

#### 5.3.13.2. Structures

The file begins with a header that is 0x400 bytes in length:

Table 5.18. The format of the $FIftiMain header. Offset Type Comment/Value 0 BYTE[4] 0x00 0x00 0x28 0x00 (unknown) 4 DWORD Number of HTML files indexed after any automatic splitting. 8 DWORD Offset to the single leaf node (if there are no index nodes), or the root index node of the tree (if there are any index nodes). This will be 4096 less than the file length. 0xC DWORD 0 (unknown) 0x10 DWORD The number of leaf nodes in the file. 0x14 DWORD Same as the value at offset 8. 0x18 WORD How many nodes deep the tree is: e.g. 1 if only a single leaf node, 2 if there is a single index node to index among the leaf nodes, 3 if there are 2 levels of index nodes. 0x1A DWORD 7 (unknown) 0x1E BYTE Scale for encoding of the document index in the WLCs 0x1F BYTE Root size for encoding of the document index in the WLCs 0x20 BYTE Scale for encoding of the code count in the WLCs 0x21 BYTE Root size for encoding of the code count in the WLCs 0x22 BYTE Scale for encoding of the location codes in the WLCs 0x23 BYTE Root size for encoding of the location codes in the WLCs 0x24 BYTE[10] 0 (unknown) 0x2E DWORD Size in bytes of each of the leaf and index nodes (4096). 0x32 DWORD 0/1 (unknown) 0x36 DWORD Word index of the last duplicate. 0x3A DWORD Character index of the last duplicate. From the first character of the first word. The whitespace after tags is not included. &amp; type things are counted as one character. Line endings are not counted in this. 0x3E DWORD Length of the longest word in the list not including NT (maximum of 99). 0x42 DWORD Number of words including duplicates. 0x46 DWORD Number of words not including duplicates. 0x4A DWORD The total length of all the words including duplicates is this DWORD plus the next one. It is unknown how the split is performed. 0x4E DWORD This one is usually smaller than the previous one. 0x52 DWORD Total length of all the words not including duplicates. 0x56 DWORD Length of unused/null bytes at the end of the word block (if only 1 block, more than total if > 1 block - possibly some free space in WLC blocks). 0x5A DWORD 0 (unknown) 0x5E DWORD One less than the number of HTML files indexed (not entirely sure). 0x62 BYTE[24] 0 (unknown) 0x7A DWORD Windows code page identifier (usually 1252 - Windows 3.1 US (ANSI)) 0x7E DWORD LCID from the HHP file. 0x82 BYTE[894] 0 (unknown) The index nodes begin with a WORD indicating the length of free space at the end of the node. This is followed by the entries, which fill up as much of the index node as possible. Table 5.19. Index node entries Offset Type Comment/Value 0 BYTE One more than the length of the word/partial word in this entry. 1 BYTE Position in the word where characters are placed. 2 BYTEs Length-1 bytes make up the word or part of the word. Not NT +0 DWORD Offset of the leaf node whose last entry is this word. +4 WORD 0 (unknown) The leaf nodes begin with a short header, which is followed by entries: Table 5.20. Leaf node header Offset Type Comment/Value 0 DWORD Offset to the next leaf node. 0 if this is the last leaf node. 4 WORD 0 (unknown) 6 WORD Length of free space at the end of the current leaf node. Table 5.21. Leaf node entries Offset Type Comment/Value 0 BYTE Length of the word/partial word in this entry plus one. Maximum of 100. 1 BYTE Position in the word where characters are placed. 2 BYTEs Length-1 bytes make up the word or part of the word. Maximum of 99 BYTEs. Not NT +0 BYTE Context (0 for body tag, 1 for title tag, other values are unknown) +1 ENCINT Number of WLC entries. +0 DWORD Offset in this file of the WLC entries for this word. +4 WORD 0 (unknown) +6 ENCINT Number of bytes used by the WLC entries for this word. Each WLC block is made up of bits packed as tightly as possible and is right-padded with 0s to a full byte. The fields are encoded as the scale and root variable-length integer format described below, with the parameters taken from the initial header. "Delta coding" is also used in a couple of places to reduce the size of the codes -- that is, the first value is stored verbatim, and subsequent values are stored as a delta or difference from the previous value. Table 5.22. WLC entries Encoding(s) Name/Comment s/r and delta (across the various entries for a single word) Document index. Indicates in which document this entry is for by way of an index into the #TOPICS file. s/r Code count. The number of times that the word this entry specifies is used in the specified topic. s/r and delta (within each WLC entry) Location codes. A list of word indices (zero based) where this word occurs in the specified topic. #### 5.3.13.3. Scale and root endcoding The scale and root method of integer encoding needs two parameters, which I'll call s (scale) and r (root size). In the context of$FIftiMain files, s always appears to be '2', but any other power of 2 could also work (and might be used in some rare cases).

The integer is encoded as two parts, p (prefix) and q (actual bits). p determines how many bits are stored, as well as implicitly determining the high-order bit of the integer. To encode an integer, p starts out as a single 0. If the integer fits in r bits, you're done. If the integer fits in r+1 bits (i.e. r-th bit is set, counting from 0), prepend a 1 to the p and store the low r bits of the integer in q. Otherwise, while the integer does not fit in the allotted space, prepend a bit to p, and increase the size of q by one bit. It's hard to see from the description, but an example will make it more clear. Using s=2, r=3:

Table 5.23. Example of scale and root encoding

Value p q
0 0 000
1 0 001
2 0 010
7 0 111
8 10 000
9 10 001
10 10 010
15 10 111
16 110 0000
17 110 0001
18 110 0010
30 110 1110
31 110 1111
32 1110 00000
33 1110 00001
34 1110 00010
62 1110 11110
63 1110 11111
64 11110 000000

A scale other than 2 has never been seen, so it is hard to say how s/r encoding works when s=4, etc. The following is how it might work using s=4, r=2 i.e. a base-4 digit is added each time, meaning two bits added each time.:

Table 5.24. Example of scale and root encoding where scale is greater than 2.

Value p q (base 4) q (binary)
0 0 00 0000
1 0 01 0001
14 0 32 1110
15 0 33 1111
16 10 00 0000
17 10 01 0001
30 10 32 1110
31 10 33 1111
32 110 000 000000
33 110 001 000001

Of course, this is all wild speculation, since examples with s other than 2 haven't been seen... But the codes do work this way (i.e. prepending a 1 to the prefix multiplies the additive value 'b' by s and adds another log2(s) bits.)

#### 5.3.13.4. Examples

This is all fairly complex, so some examples will be extremely useful here. First a short ficticious example, then an example from the real world.

The following table is a ficticious example of how the words appear in leaf nodes. The storage of words in index nodes is the same as in leaf nodes, except some fields are gone.

Table 5.25. Example leaf node entries

Stored word Whole word Length+1 to add Position to add at Context
john john 5 0 0
sh josh 3 2 0
ington joshington 7 4 0
joshington 1 10 1 (title)

This example is taken from a copy of windows.chm, the system documentation apparently distributed with some version of Windows 98:

Hex dump of two leaf node entries:

000223d:                                        02 00 31  ...0...........1
0002240: 00 0a 03 04 00 00 00 00 1d 01 01 01 01 20 04 00  ............. ..
0002250: 00 00 00 03


The fields of these two entries are as follows:

Table 5.26. Example leaf node entries

Field 1st entry 2nd entry
New length 2 1
Old length 0 1
Word "1"
Context 0 1
Num WLC ents 0xA 1
Offset 0x403 0x420
Unknown 0 0
WLCs length 0x1D 3

Scanning over to offset 0x403 in the file, we see:

0000403:          f9 f4 60 86 b8 ea 6a 00 ed 78 00 2d c0
0000410: f8 d7 28 2c f0 f6 dc c8 ce 66 61 80 87 02 00 00
0000420: f9 f4 40


Broken out, these WLC entries are:

1          <10, 1027, 29>:  f9 f4 60 86 b8 ea 6a 00 ed 78 00 2d c0 f8 d7 28
2c f0 f6 dc c8 ce 66 61 80 87 02 00 00
1 (TITLE)  <1, 1056, 3>:    f9 f4 40


Now, the parameters for the WLC in this file are 2/2, 2/1, 2/5. Here is a quick reference table for the codes:

p       value       q (bits)
2/1:
0:      0-1         1
10:     2-3         1
110:    4-7         2
1110:   8-15        3
11110:  16-31       4
111110: 32-63       5
2/2:
0:      0-3         2
10:     4-7         2
110:    8-15        3
1110:   16-31       4
11110:  32-63       5
111110: 64-127      6
2/5:
0:      0-31        5
10:     32-63       5
110:    64-127      6
1110:   128-255     7
11110:  256-511     8
111110: 512-1023    9

f9 f4 40 => 1111 1001 1111 0100 0100 0000
2/2 Document index: 111110 011111 => 64 + 31 => Document no. 95
2/1 Code count:     0 1           => 1
2/5 Location codes: 0 00100       => 4       => Word no. 4


So, in document no. 95, word no. 4 is a '1' which is in the title. Now, the ordering of the documents is provided by the #URLTBL and #URLSTR files. Looking up document no. 95 in there (0-based indexing!), we see the file is internet_account.htm, in which, the first non-markup text is:

<title>Dial-Up Networking: Step 1</title>
0: dial
1: up
2: networking
3: step
4: 1


Now, the next one is a little more complicated. I won't go over it in as much detail, but I'll just break it out quickly. It contains 10 entries:

(111110 011111) ( 0 1) (   0 00110  )             0000
(    10 00    ) ( 0 1) (  10 10111  )             000
(  1110 1010  ) ( 0 1) (  10 10100  )             0000000
(  1110 1101  ) ( 0 1) (1110 0000000)             000
(     0 01    ) ( 0 1) (  10 11100  )             0000
(111110 001101) ( 0 1) ( 110 010100 )             0
(     0 01    ) ( 0 1) (  10 01111  )             0000
( 11110 11011 ) ( 0 1) ( 110 011001 )             000
(   110 011   ) (10 0) ( 110 011001 ) ( 10 00011) 0000000
(    10 00    ) ( 0 1) ( 110 000001 )             0
00000000 00000000


Parsing those entries, we get:

Table 5.27. Example WLC entries.

Document number Word numbers
95 6
95+4 = 99 55
99+26 = 125 52
125+29 = 154 128
154+1 = 155 60
155+77 = 232 84
232+1 = 233 47
233+59 = 292 89
292+11 = 303 89 and 89+35 = 124
303+4 = 307 65

Picking one at random, say, Document no. 303 with 2 hits, we open up windows_netsetup_netwin.htm, from which I've generated a wordlist containing all of the words in order:

  0: To(TITLE)
1: set(TITLE)
2: up(TITLE)
..
..
86: client
87: follow
88: steps
89: 1
90: 3
..
122: follow
123: steps
124: 1
125: 3
126: and
..


And we can see the word '1' shows up in precisely the 89th and 124th spots.

The CHM compiler extracts words from all the input files (those with .h in their names), converts them to lowercase and stores them in the $FIftiMain file. Words in the leaf and index nodes are made up of the following characters stored as is: 0x01 (buggy), 0-9, a-z, _, 0xDE, 0xFE. Bytes from the input files are converted and stored as per the table below. Character entity references of the form &#9660; are truncated to BYTEs and translated as per the table below. Character entity references of the form &amp; are treated as whitespace, except for the the latin characters, which are converted as per the table below. Table 5.28. Conversions Before After A-Z a-z 0x8A, 0x9A s 0x8C, 0x9C oe 0x9F, 0xDD, 0xFD, 0xFF, &Yacute;, &yacute;, &yuml; y 0xC0-0xC5, 0xE0-0xE5, &Agrave;, &Agrave;, &Aacute;, &Acirc;, &Atilde;, &Auml;, &Aring;, &agrave;, &aacute;, &acirc;, &atilde;, &auml;, &aring;. a 0xC6, 0xE6, &AElig;, &aelig; ae 0xC7, 0xE7, &Ccedil;, &ccedil; c 0xC8-0xCB, 0xE8-0xEB, &Egrave;, &Eacute;, &Ecirc;, &Euml;, &egrave;, &eacute;, &ecirc;, &euml; e 0xCC-0xCF, 0xEC-0xEF, &Igrave;, &Iacute;, &Icirc;, &Iuml;, &igrave;, &iacute;, &icirc;, &iuml; i 0xD0, &ETH; d 0xD1, 0xF1, &ntilde;, &Ntilde; n 0xD2-0xD8, 0xF0, 0xF2-0xF8, &eth;, &Ograve;, &Oacute;, &Ocirc;, &Otilde;, &Ouml;, &Oslash;, &ograve;, &oacute;, &ocirc;, &otilde;, &ouml;, &oslash; o 0xD9-0xDC, 0xF9-0xFC, &Ugrave;, &Uacute;, &Ucirc;, &Uuml;, &ugrave;, &uacute;, &ucirc;, &uuml; u 0xDF, &szlig; ss &thorn; 0xFE &THORN; 0xDE These conversons may depend on the system codepage, character set, font and language set in the HHP file, but I'm just guessing here. There are a few known bugs that are described in the following few paragraphs. An 0x01 in a word causes the first whitespace character at the end of the word to be included in the word and if the next character is non-whitespace the word is joined to the next word. If the word begins with 0-9 then the word is terminated before the 0x01 and a new word begins at the 0x01. This bug affects the fields in the initial header. For example: abcd0x1efghi-foobar is converted to abcd0x1efghi-foobar. abcd0x1efghi- foobar is converted to abcd0x1efghi- and foobar. 0bcd0x1efghi-foobar is converted to 0bcd and 0x1efghi-foobar. 0bcd0x1efghi- foobar is converted to 0bcd, 0x1efghi- and foobar. Weird bug where if the word is 16 characters in length then the word is doubled plus the first 7 chars in length. There is a weird feature that if a word starts with 0-9 then it may contain multiple periods (0x2E = '.') or commas (0x2C = ',') embedded in the word before the non-period, non-comma word terminating character. I think this feature is so that the user can search for version numbers or numbers with a decimal point or thousands separator in them. Note that commas are removed from the word, while periods are not. For example v1.1.234.5......,6 will become v1 and 1.234.5......6. There is a weird bug involving words ending in single quote (') being forgotten when the same word is also normal and also ending in a period (.). There are probably many more hidden bugs and features in the word converter (I think its the the ITIR.StdWordBreaker class in ITIRCL.DLL). ### 5.3.14.$OBJINST

From the name and the number of GUIDs present I guess it has something to do with ActiveX objects. Seems like it can be deleted without major consequence.

Table 5.29. The format of the $OBJINST header. Offset Type Comment/Value 0 DWORD 0x04000000 (unknown) 4 DWORD Number of entries This is followed by an listing, and each listing entry is as follows Table 5.30. The format of the$OBJINST listing entries.

Offset Type Comment/Value
0 DWORD Offset of the entry in this file
4 DWORD Length of the entry

The listing is followed by the entries one after another at offsets specified in the listing.

There are 2 known types of entries. The first seems to be made up of up to 3 different sub entries. The second is a 36 BYTE structure.

Table 5.31. The first entry

Offset Type Comment/Value
0 GUID {4662DAAF-D393-11D0-9A56-00C04FB68BF7}
0x10 DWORD 0x04000000 (unknown) Possibly a big-endian version number of the class that the GUID refers to.
0x14 DWORD Unknown. Methinks bitflags that somehow affect the size of entries that have the 0x04000000 DWORD, like each bit specifies the presence/absence of a specific subentry.
0x18 DWORD Windows code page identifier (usually 1252 - Windows 3.1 US (ANSI))
0x1C DWORD LCID from the HHP file.
0x20 BYTEs Unknown
+0 Entries

I haven't been able to find any files without the data for bits 0 & 1 so I can't really say exactly how big the header is and which bytes are part of the bit 0 block and which are part of the bit 1 block. Together, though, bits 0 & 1 account for a large bulk of repeatedly increasing byte blocks of 10 bytes each, plus something else at the end. I suspect that the repeats are for bit 0 and the stuff at the end is bit 1. As to the function of these two bits blocks, well there are no GUIDs and no other clues, so who knows.

Table 5.32. bit 2. Only present when "Full text search stop list file" has been specified in the HHP.

Offset Type Comment/Value
0 char[4] ""(\0
4 DWORD Length in bytes of the entries not including the last zero word.
8 BYTE[32] 0 (unknown)
0x28 Entries. The last entry has a zero length word.

Table 5.33. bit 2 entries

Offset Type Comment/Value
0 WORD Length of the word
2 char[length] ANSI/UTF-8 string from the stop list file, may be uniqified & sorted reverse alphabetically. Not NT.

Table 5.34. bit 3

Offset Type Comment/Value
0 GUID {8FA0D5A8-DEDF-11D0-9A61-00C04FB68BF7}
0x10 DWORD 0x04000000 (unknown) Possibly a big-endian version number of the class that the GUID refers to.
0x14 DWORD 1 (unknown)
0x18 DWORD Windows code page identifier (usually 1252 - Windows 3.1 US (ANSI))
0x1C DWORD LCID from the HHP file.
0x20 DWORD 0 (unknown)

Table 5.35. The second entry

Offset Type Comment/Value
0 GUID {4662DAB0-D393-11D0-9A56-00C04FB68B66}
0x10 DWORD 666 (May represent the version of the class that the GUID refers to)
0x14 DWORD Windows code page identifier (usually 1252 - Windows 3.1 US (ANSI))
0x18 DWORD LCID from the HHP file.
0x1C DWORD Unknown. Almost always 10031. Also 66631 (accessib.chm from the MSDN).
0x20 DWORD 0 (unknown)

### 5.3.15. $WWAssociativeLinks &$WWKeywordLinks

#### 5.3.15.1. Preface

The files in the $WWAssociativeLinks and$WWKeywordLinks directories have the same formats. The maximum total length (including parents) of an entry in one of these files is 488 characters (including NT). HHW complains about and refuses to output any that are greater than this length.

The $WWKeywordLinks dir specifies the contents of the Index navigation pane & the$WWAssociativeLinks dir specifies the Alinks.

#### 5.3.15.2. BTree

In CHW files this is named BTREE and in CHI/CHM files it is named BTree.

This file has a 76 byte header, then 2048 byte blocks. First come all the listing blocks, then all the index blocks. This file is similar to the directory entries in the ITSF format, except that the index blocks are at the end instead of interspersed with the listing blocks. All block indices below are zero based. This file forms a tree, with the last (index mostly) block being the root of the tree. If there is more than one level of index blocks then the root block will have two children; the first in the block header and the second in the entry. WARNING: just as in the ITSF directory there can be garbage in the free space, so respect that first WORD and use it. I'm not yet sure how the listing blocks are split up, though it is probably the same as the ITSF directory (space filling).

Table 5.36. The format of the BTree header.

Offset Type Comment/Value
0 char[2] ;) (0x3B 0x29) (signature)[a]
2 WORD [a]Flags. Bit 0x2 always 1. Bit 0x0400 1 if directory?? (this is always on)
4 WORD Size of the blocks (2048)
6 BYTE[16] Always X44. [a]It is likely to be a string describing format of data
'L' = DWORD (indexed)
'F' = NUL-terminated string (indexed)
'i' = NUL-terminated string (indexed)
'2' = WORD
'4' = DWORD
'z' = NUL-terminated string
'!' = DWORD count value, count/8 * record
DWORD filenumber
DWORD TopicOffset

0x16 DWORD 0 (unknown)
0x1A DWORD Index of the last listing block in the file.
0x1E DWORD Index of the root block in the file.
0x22 DWORD -1 (unknown)
0x26 DWORD Number of blocks
0x2A WORD The depth of the tree of blocks (1 if no index blocks, 2 one level of index blocks, ...)
0x2C DWORD Number of keywords in the file.
0x30 DWORD Windows code page identifier (usually 1252 - Windows 3.1 US (ANSI))
0x34 DWORD LCID from the HHP file.
0x38 DWORD 0 if this a BTREE and is part of a CHW file, 1 if it is a BTree and is part of a CHI or CHM file
0x3C DWORD Unknown. Almost always 10031. Also 66631 (accessib.chm, ieeula.chm, iesupp.chm, iexplore.chm, msoe.chm, mstask.chm, ratings.chm, wab.chm).
0x40 DWORD 0 (unknown)
0x44 DWORD 0 (unknown)
0x48 DWORD 0 (unknown)

[a] These were guessed from the documentation provided with helpdeco by Manfred Winterhoff.

Table 5.37. The format of the BTree listing blocks header.

Offset Type Comment/Value
0 WORD Length of free space at the end of the block.
2 WORD Number of entries in the block.
4 DWORD Index of the previous block. -1 if this is the first listing block.
8 DWORD Index of the next block. -1 if this is the last listing block.

Table 5.38. The format of the BTree listing block entries.

Offset Type Comment/Value
0 WCHARs Value of the first Name entry from the HHK UTF-16/UCS-2. If this is a sub-keyword, then this will be all the parent keywords, including this one, separated by ", ". UTF-16/UCS-2 NT.
+0 WORD 2 if this keyword is a See Also keyword, 0 if it is not.
+2 WORD Depth of this entry into the tree.
+4 DWORD Character index of the last keyword in the ", " separated list.
+8 DWORD 0 (unknown)
+0xC DWORD Number of Name, Local pairs
+0x10 DWORDs or WCHARs DWORDs:Index into the #TOPICS file. UTF-16/UCS-2 NT string: The value of the See Also string.
+0 DWORD Mostly 1 (unknown)
+4 DWORD Zero based index of this entry in the file (not block). Increments by 13 (each entry is 13 more than the last).

Table 5.39. The format of the BTree index blocks header.

Offset Type Comment/Value
0 WORD Length of free space at the end of the block.
2 WORD Number of entries in the block.
4 DWORD Index of a child block.

Table 5.40. The format of the BTree index block entries.

Offset Type Comment/Value
0 WCHARs Value of the first Name entry from the HHK. If this is a sub-keyword, then this will be all the parent keywords, including this one, separated by ", ". UTF-16/UCS-2 NT.
+0 WORD 2 if this keyword is a See Also keyword, 0 if it is not.
+2 WORD Depth of this entry into the tree.
+4 DWORD Character index of the last keyword in the ", " separated list.
+8 DWORD 0 (unknown)
+0xC DWORD Number of Name, Local pairs
+0x10 DWORDs or WCHARs DWORDs:Index into the #TOPICS file. UTF-16/UCS-2 NT string: The value of the See Also string.
+0 DWORD Index of a child block. If it is a listing block then it is the one that starts with the keyword at the start of this entry

#### 5.3.15.3. Data

In CHW files this is named DATA and in CHI/CHM files it is named Data.

This file contains entries that are 13 bytes in length. All known entries have thus far contained the following bytes: 00000000 05000000 80000000 00. AFAICS this file is useless.

#### 5.3.15.4. Map

In CHW files this is named MAP and in CHI/CHM files it is named Map.

Begins with a WORD indicating the number of entries in the file (also the number of listing blocks in the BTree file). Each entry is 2 DWORDs. The first is a cumulative sum of the number of keywords in the BTree listing blocks & the second is a consecutively increasing index number. Both start at zero.

#### 5.3.15.5. Property

In CHW files this is named PROPERTY and in CHI/CHM files it is named Property.

If there are no links of this type in the CHM then this will be a zero DWORD. Othewise it contains the following DWORDs: 0, 0, 0, 0xC, 1, 1, 0, 0. AFAICS this file is pretty much useless.

### 5.3.16. $HHTitleMap The file begins with a WORD indicating the number of entries. Each entry has the following format: Table 5.41. The format of the$HHTitleMap entries.

Offset Type Comment/Value
0 WORD Length of the file stem.
2 BYTEs File stem. ANSI/UTF-8 string. Not NT.
+0 DWORD Unknown.
+4 DWORD Unknown. Same value as previous DWORD.
+8 DWORD LCID of the specified file.

### 5.3.17. $TitleMap The file begins with a WORD indicating the number of entries. Each entry is 68 BYTEs in length and has the following format: Table 5.42. The format of the$TitleMap entries.

Offset Type Comment/Value
0 BYTE[25] File stem. ANSI/UTF-8 NT fixed length string.
0x19 BYTE[25] Unknown. Seems to be RAM litter, but contains paths, file names, zero bytes, DWORDs and mixtures.
0x32 WORD An index number that begins at 1 and is incremented by 1 for each entry.
0x34 DWORD Unknown.
0x38 DWORD Unknown. Same value as previous DWORD.
0x3C DWORD LCID of the specified file.
0x40 DWORD Number of topic nodes including the contents & index files in the specified file.

### 5.3.18. /Path/file.chm/windowtype

It is a cache of user customized bits of the windowtype entry from the #WINDOWS file of the \Path\file.chm CHM file.

Table 5.43. The format of the windowtype entries.

Offset Type Comment/Value
0 DWORD Size of the file in bytes (44)
4 Signed DWORD Position of the left edge of the window.
8 Signed DWORD Position of the top edge of the window.
0xC Signed DWORD Position of the right edge of the window.
0x10 Signed DWORD Position of the bottom edge of the window.
0x14 DWORD Width of the navigation pane in pixels.
0x18 DWORD Non-zero if search highlight is on.
0x1C DWORD Unknown. Not font size, printing options or show state.
0x20 DWORD Non-zero if there is no text of the toolbar buttons.
0x24 DWORD Non-zero if the navigation pane is initially closed.
0x28 DWORD Which navigation tab is currently open.

UTF-16/UCS-2 NT string. Each search item is separated by a UTF-16/UCS-2 Line Feed character. The string is followed by an unknown WORD.

DWORD. Only the lowest 3 bits are used, the other bits are unknown. is controlled by bit 0. is controlled by bit 1. is controlled by bit 2. Note that since previous search results are not stored anywhere as yet HH will uncheck the checkbox even if its bit is on. IMHO this is a bug: HH should automatically search the whole file if there are no previous results and the checkbox is checked.

### 5.3.21. /Path/file.chm/Bookmarks/v1/Count

A DWORD indicating the number of favourites stored for the \Path\file.chm CHM file.

### 5.3.22. /Path/file.chm/Bookmarks/v1/n/Topic

An NT UTF-16/UCS-2 string showing the topic name of bookmark number n (n is zero based).

### 5.3.23. /Path/file.chm/Bookmarks/v1/n/Url

An NT UTF-16/UCS-2 string showing the URL of bookmark number n (n is zero based). It is a fully qualified path into the \Path\file.chm CHM file.

### 5.3.24. #KEY_DELETED

A set of ANSI/UTF-8 NT strings indicating which internal files have been deleted in the new file. Names use backslash (\) instead of forward slash (/) & don't have an initial slash.

### 5.3.25. #KEY_DATA

Table 5.44. The format of the #KEY_DATA file.

Offset Type Comment/Value
0 DWORD[8] Unknown.
0x20 DWORD Length of the name of the old chm.
0x24 BYTEs Name of the old chm. ANSI/UTF-8 NT.