********************************************************************* **** **** **** **** **** User Manual for XMill **** **** **** **** **** ********************************************************************* Content 1. Introduction 1.1. Underlying Techniques 2. Options 2.1. Basic Options 2.2. Options for White Space and Special String Handling 2.2.1. Options for White Space Handling 2.2.2. Options for Special String Handling 3. Path Expressions 3.1. The '#' symbol 3.2. Order of Path Expressions 3.3. Default Grouping 4. User Compressors 5. Options of the Decompressor XDemill 6. The Compression Verbose Mode ********************************************************************* 1. Introduction XMill is a special-purpose compressor for XML documents that typically achieves substantially better compression rates. For large files, we achieved compression rates twice as good as gzip's compression rate. Similar to gzip, XMill is a command-line tool that works on a file-by-file basis. A given file with extension '.xml' is compressed into a file with extension '.xmi'. Any other file without extension '.xml' is compressed into a file by appending extension '.xm'. Reversely, the original file is obtained by replacing extension '.xmi' with extension '.xml' or by removing extension '.xm'. Alternatively, the user write the output to the standard output and optionally read from the standard input. 1.1. Underlying Technique XMill is based on a grouping strategy that groups and compresses text items together based on their semantics. For example, a sequence of elements (with , , , ... elememts) in an XML document could be rearranged by grouping all names, all ages, and all shoe sizes together. This will typically lead to higher compression ratio, since each group will contain text items with high similarities. The default grouping strategy is by considering the parent label of the text item. Even though this works well, there are cases when a label has different meanings in different parts of the document (e.g. in <Person> has a different meaning from the <Title> in <Book>) or when different labels have the same meaning (e.g. a <ChildName> in <Person> contains person name like <Name>). Therefore, XMill provides a powerful regular path expression language for grouping text items with respect to their meaning. Each text item is reachable over a 'path' of labels from the root of the XML document. For example, the social security of an employee of a company might be stored in the 'ssno' attribute of element '<Employee>' in element '<Company>' in the root element '<Root>'. Hence, the path to the text item is /Root/Company/Employee/@ssno. Section 2 describes how regular path expression are specified and matched with a specific path. After the grouping in containers, conventional compressors, such as gzip are applied to the containers and will exploit those similarities. Since the number and size of containers grows with the size of the XML file, a memory window mechanism is implemented: After the overall size of the containers reaches a certain user-specified memory window, the containers a compressed and stored in the output file. The compressed content of the memory window is called a 'run'. After the 'run' is stored in the output file, the containers are filled with data again. In addition to path expression, the user can also specify how to "pre-compress" the specific text item. For example, the user might want to replace the 'Age' string by its binary integer representation. Or more complex, an IP number might be replaced by four bytes. XMill allows the user to specify additional "user compressors" to pre-compress the text items before it is stored in the containers. Note that the gzip compression is still applied to the containers afterwards. XMill provides an interface for writing own user compressors in C++. This is particularly useful for domain-specific data, such as DNA sequences, 3D coordinates, etc. 2. Options 2.1. Basic Options The XMill compression can be fine-tuned by several options. The set of available options of XMill is given by the following syntax: xmill [-i file] [-v] [-p path] [-m num] [-1..9] [-c] [-d] [-r] [-w] [-h] file... with -i file - include options from file -v - verbose mode -p path - define path expression -m num - set memory limit -1..9 - set the compression factor of zlib (default=6) -c - write on standard output -d - delete input files -f - force overwrite of output files -w - preserve white spaces -h - show extended white space options and user compressors The options have the following meaning: -i file - Reads additional options from text file 'file'. The text file must contain a sequence of options that are separated by white spaces. The external option file is particularly useful for path expressions (Section 2). -v - Outputs additional information about the grouping and compression ratios achieved by the compressor for the given setting. The format of the output is described in Section 4. -p path - Defines a new path expression to group text items together. Path expressions are describe in Section 2. -m num - Sets the memory window limit for compression. The compressor fills the memory window until it is full, compresses it afterwards and writes the compressed output to the file. Then, the parsing continues and the memory window is filled again. A smaller memory window will create more smaller 'runs' in the output file. Hence, the compression rate will sligtly decrease. The window limit is specified in MByte. The default memory window limit is 8Mbyte. -1..9 - Sets the compression factor of zlib library. The zlib library is program interface to gzip. It is based on the Lempel-Ziv compression method. The Lempel-Ziv compression can yield better results for higher compression factors, but can also be substantially slower. The default value is 6. -c - Write the compressed output to the standard output instead of to some file. This allows the user to pipe the output directly into some other process. If no file is specified in the command line and option '-c' is specified, then the program also reads the input from the standard input. -d - Deletes the input XML files after the compression. The default is to keep the input file. -f - Overwrites any existing compressed files. If this option is not specified, then the user is asked for every existing output file whether it should be overwritten. -w - Preserves all white spaces in the XML document. If the option is not specified, only the elements and the text strings are stored in the compressed file. The decompressor can optionally add different kind of indendations to produce a legible XML document. The preservation of white spaces will guarantee the exact perservation of the XML document, if the specified user compressors are not "lossy" (see Section 3). The preservation and storage of white spaces can be controlled through more options described in Section 1.2. -h - Shows the extended options and lists all predefine user compressors. file... - XML file(s) to be compressed. For each file named 'name.xml', a compressed file 'name.xmi' will be created. 2.2. Options for White Space and Special String Handling Several extended options are available to control the handling of white spaces and special XML strings, such as DTDs or processing instructions. The general syntax of those extended options is as follows: xmill ... [-w(i|g|t)] [-l(i|g|t)] [-r(i|g|t)] [-a(i|g)] [-n(c|t|p|d)] file ... with the following short description: -wi - ignore complete white spaces (default) -wg - store complete white spaces in global container -wt - store complete white spaces as normal text -li - ignore left white spaces (default) -lg - store left white spaces in global container -lt - store left white spaces as normal text -ri - ignore right white spaces (default) -rg - store right white spaces in global container -rt - store right white spaces as normal text -ai - ignore attribute white spaces (default) -ag - store attribute white spaces in global container -nc - ignore comments -nt - ignore DOCTYPE sections -np - ignore PI sections -nd - ignore CDATA sections 2.2.1. Options for White Space Handling Before we can describe the exact meaning of each white space option, the following explanations are necessary. A white space sequence is a consecutive sequence of white space characters (' ', '\t', '\n', and '\r'). XMill distinguish three types of white space sequences that can occur in XML documents: 1. A "complete white space sequence" is a sequence that is delimited by some start or end tags. Consider the following example: <Employee @name=" Tom " @ssno="123-45-6789" > <Address> New York City </Addres> <Age> 34 </Age> </Employee> All white spaces (newlines, tabulators, ...) between element <Employee @name=" Tom "> and element <Address>, element </Address> and <Age>, and </Age> and </Employee> are complete white space sequences. 2. A "left white space sequence" is a sequence that is delimited by an element tag on the left and some text on the right. For example, the space between <Address> and New York City and the space between <Age> and 34 are left white space sequences. Furthermore, the attribute string " Tom " has a left white space at the beginning. 3. Analogously, a "right white space sequence" is delimited by some text on the left and some element tag on the right. For example, the sequences between New York City and </Addreess> and the space between 34 and </Age> are white space sequences. Furthermore, the trailing space in " Tom " forms a right white space sequence. 4. An "attribute white space sequence" is a white space sequence between attributes in the start tag. This includes white spaces between the element name and the first attribute, the white spaces between attributes and the white spaces between the last attribute and the '>' symbol at the end of the element. In the example above, the spaces between 'Employee' and '@name...', between '...=" Tom "' and '@ssno...' and '...-6789' and '>' are attribute white space sequences. For the first three types of white spaces, there are three possible storage techniques: 1. The white spaces could be ignored. In this case, the decompressor cannot reconstruct the exact original content. Instead, the decompressor creates a user-defined indendation using tabulators or spaces. 2. The white spaces are treated as normal text. In this case, the grouping mechanism and user compressors are invoked on each of the white space sequences. Left and right white spaces are joined together with the text before user compressor are applied. For example, ' 34 ' would form a single text item. 3. All white spaces are stored in a separate global container. For example, the left white space sequence ' ' and the right white space sequent ' ' could both be stored in the global container while the string '34' is handled by a user compressor. For attribute white spaces, only options 1 and 3 are possible. If attribute white spaces are ignored, then the decompressor automatically inserts a single space ' ' between all attribute definitions. 2.2.2. Options for Special String Handling There are four types of special XML strings: comments (of the form '<!-- ... -->' ), DTDs ( '<!DOCTYPE ... >' ), processing instructions ( '<? ... ?>' ) and CDATA sections ( '<![CDATA[ ... ]]>' ). By default, all special XML strings are stored in the compressed file in the same global container. Using options '-nc', '-nt', '-np', and '-nd', some of the special XML strings can be ignored. 3. Path Expressions Path expressions are used to group text items together. For example, one might want to group all social security numbers of employees together. All text items in the same group are stored and compressed in the same container. Because of the strong similarities between the text items, substantially higher compression rates can be achieved. Path expression can be specified through option '-p...'. Each path expression describes paths from the root of the XML document to some node within the document. The language of path expressions is based on XPath. Each path must start with symbol '/' and components of the path are separated by '/'. For example, '/Company/Employee/@ssno' denotes the social security numbers of employees in company elements. The symbol '@' precedes attributes. The special label '*' matches any label. For example, /*/Employee/@ssno matches the SS# of any employee that located one level deep in the XML document. The separator '//' can be used instead of '/' to specify an arbitrary sequence of labels between two labels. For example, '/Company//Employees' describes employees *anywhere* within a company element. Or, '//Employees' describes all employees in the entire document. It is also to specify alternative labels for each item in the path. For example, (Employee|Secretary) represents employees or secretary labels and '/Company//(Employee|Secretary)' has the expected meaning. Note that the parentheses can be omitted, since '|' has a higher precedence than '/' and '//'. Each path expression identifies all text items that should be grouped in one container. Note that text items in deeper nested levels of the XML document are not included by the same path expression. This is particularly important in the context of mixed content. Consider the following XML data: <Employee> Alex <middle> P. </middle> Miller </Employee> Here, a path expression for employees (e.g. '//Employee') would only group text items 'Alex' and 'Miller' and not the text item 'P.'. 3.1. The '#'-Symbol Conventional path expressions only allow us to specify the text items for one single container. Hence, a separate container expression must be specified for each container. The special label '#' allows us to group data items into multiple containers at the same time. Basically, the '#' label stands for any kind of label. For each possible instantiation of the '#' label with a specific label, a different data container is generated. For example, the path expression '/Employee/#' generates a container for each subelement of 'Employee'. The path expression '#//' would group text items depending on their. top level element in the XML document. There are two additional symbols for specifying multiple container groupings: '@#' and '##'. While the '#' symbol can be instantiated as an element or attribute label, the '@#' symbol can only be instantiated as an attribute. Similar to '//', the '##' symbol describes a sequence of labels. Each possible path instantiation for '##' identifies a separate container. Such generalizes path expressions with '#' symbols are typically ambiguous. For example, consider the path expression '//#//'. It is not clear what the instantation of the '#' symbol could be, since it could be any label on any path. Therefore, we adopt the following policy: Path expressions are always parsed from *right to left*, i.e. the parsing starts at the leaf node for a specific text item. The reason for this rather counterintuitive approach is described in [1]. The basic policy for '#' symbols is to instantiate them *as early as possible* during the parsing, i.e. '#' symbols are instantiated in the right-most manner. Therefore, the '#' symbol in '//#//' would be instantiated to the last label in the path. This label would identify the container for the text item. 3.2. Order of Path Expressions Several path expressions can identify the same text item. For example, '//Employee/@ssno' identifies a subset of the text item of '//*/@ssno'. Therefore, the order of the path expression in the command line (or in the option file) describes the precedence of the path expression. For example, xmill -p//Employee/@ssno -p//*/@ssno will group all SS#s of employees in a container separate from all other SS#s. In contrast, xmill -p//*/@ssno -p//Employee/@ssno would group all SS#s in one container together. There is not text item that would match the second path expression, but does not match the first path expression. 3.3. Default Grouping In order to guarantee that each text item has at least one matching path expression, the two path expressions '//#' and '/' are concatenated at the end of the list user-defined path expressions. The expression '//#' captures all non-empty paths in the database and builds containers based on the last label in the path. The expression '/' captures the empty path and all text items at the root of the XML tree. 4. User Compressors Path expressions are a powerful mechanism to group text items with respect to their meaning. It is quite often the case that these text items have similar origin and syntax. For example, social security number will be strings of the form 'DDD-DD-DDDD' where 'D' denotes digits. It is often possible to apply additional "semantic compression" to those string by taking advantage of the semantic knowledge. In XMill, user compressors can be used to compress text items grouped by the same path expression together. Path expressions with user compressors have the following syntax: pathexpr=>usercompressor For example, to compress all ages of employees as positive integers, one could write '//Employee/Age=>u'. To convert a social security number into three integer parts, one could write '//Employee/@ssno=>seqcomb(u "-" u "-" u)'. XMill currently provides the following user compressors: di - Delta compressor for signed integers e - Compressor for small number of distinct data values i - Compressor for signed integers or - Variant compressor p - Print compressor rep - Compressor for substrings separated by some delimiter string rl - Run-length encoder for arbitrary text strings seq - Sequence compressor for strings with separators seqcomb - Combined sequence compressor for strings with separators t - Plain text compressor u - Compressor for unsigned integers u8 - Compressor for integers between 0 and 255 "..." - Constant compressor It is important to note however the XMill is extensible and that the programmer can easily add additional special-purpose user compressors. Each user compressor consumes strings with a specific syntax. Each string is replaced by a different (typically much smaller) string that represents the same information. If the string does not have the expected syntax, then the user compressor rejects the string, then the next matching path expression with a different user compressor is considered. If no user compressor matches the string, then the default user compressor 't' is used in path expressions '//#' or '/' is used. We distinguish two types of user compressors: atomic user compressors and combined user compressors. While atomic user compressors parse and compress certain text atoms such as integers, combined user compressor is parameterized by other, smaller user compressors. In the example above, the user compressor 'seq(u "-" u "-" u)' has parameter 'u', "-", 'u', "-", and 'u'. In general, user compressors can have any parameter string (...) following the identification name and depending on the user compressor, the parameter string must satisfy certain restrictions. Each user compressor operates on a fixed number of containers. For example, the user compressor 'u' only needs one container (to store the binary integer values), while 'seq(u "-" u)' requires two containers: one for each of the integers. The meaning of the user compressors follows in detail: di - Delta compressor for signed integers. The delta compressors stores the changes between successive integer numbers. For example, the sequence '10010', '10024', '10005', '10006' would be represented as '10010', '14', '-19', '1'. Note that integers in general are compressed by using a variable length encoding scheme: Small values are represented by only one byte, medium values by two, and large values by four bytes. The delta compressor can have an integer value as a parameter describing the minimal number of characters in the output. If a given number is too short, then leading zeros are inserted. For example, 'di(3)' specifies that number should have at least 3 characters. Hence, '23' and '-2' would be displayed as '023' and '-02', respectively. Note that the parameter only affects the output (i.e. decompression) of the string. The input XML file is parsed as before and no additional syntactic restrictions are imposed. e - "Dictionary" or "Enumeration" Encode. The dictionary encoder replaces each text item by a specific integer number that is an index into a dictionary. The dictionary is extended whenever a new text item is found. In addition to the indices, the user compressor must also store the dictionary in the compressed file. The dictionaries are separately stored at the beginning of the file. Note that dictionaries are particularly efficient for path expressions that identify a small set of distinct text items, such as airport codes, country names, etc. i - Compressor for signed integers. This compressor converts signed integers into their binary format. Variable length encoding is applied to reduce the size of the binary representation. Similar to delta encoding, an optional parameter can specify the minimal number of characters in the result. or - Variant compressor. The variant compressor tries to apply different kind of compressors to the same text item until one of them succeeds. The parameters of the variant compressor are the alternative subcompressors that should be tested for each string. For example, the user compressor string 'or(u seq(u "-" u))' accepts simple integers such as '45' or integers pairs such as '34-67'. This is useful when related information is represented in different ways, such as for page numbers. Note that if none of the subcompressors accepts the string, then the variant compressor itself rejects the string. To determine which subcompressor was used, it is necessary to keep an additional index value. Therefore, the 'or' compressor fills an additional container that stores the index. p - Print compressor. The print compressor can be used for debug purposes. It never accepts or compresses any string. Instead, it simply prints the string to the standard output. This is useful to verify that a certain text item always complies to a certain format. For example, '/Employee/Age=>or(u p)' will print all ages that are not positive integers. rep - Compressor for substrings separated by some delimiter string. The repeat compressor compresses a string by dividing it by certain delimiters and compressing each part separately. For example, to compress a comma-separated sequence of keywords, one could use the string 'rep("," e)'. The string "," denotes the delimiter and 'e' denotes the user compressor applied to each string piece - i.e. the keywords. In addition to the container(s) required by the subcompressor, one extra container is needed to store the number of string pieces for a given string. Intuitively, the decompressor first reads the number of string pieces and then reads as many string pieces from the containers of the subcompressor. Optionally, a third parameter can be specified to compress the last string piece after the last delimiter separately. For example, 'rep("," e u)' applied to string 'xml,compress,xmill,5' would apply compressor 'e' to the three text pieces 'xml', 'compress', and 'xmill' and would apply compressor 'u' to string '5'. A current restriction of the repeat compressor is that the subcompressor must always be accepting. I.e. one could not write 'rep("," u)'. The reason for this is that in the case of a rejection, the previous parsing and compression of text items must be undone. The problem can easily be solved by using a variant compressor such as 'or(u t)': 'rep("," or(u t))'. rl - Run-length encoder for arbitrary text strings. The run-length represents a several identical strings by a tuple (len,string) where 'len' is the number of identical strings and 'string' is the actual string. This is particularly effective for data items whose value does not change frequently within the file. seq - Sequence compressor for strings with separators. The sequence compressor divides a given string along delimiter lines and compresses each piece separately by different compressors. The parameter of a sequence compressors is an alternating sequence of subcompressors and delimiter strings, i.e. each subcompressor must be follows by a delimiter string and vice versa. For a given string, the sequence compressor first traverses the string to find the first delimiter. The part of the string before the delimiter is compressed using the first compressor. Then, the second delimiter is identified and so on. The subcompressors parse the string pieces separetely and store the compressed representation in separate containers. For example, 'seq(u "-" u "-" u)' will store an integer in three different containers for each string. seqcomb - Combined sequence compressor for strings with separators. The only different between 'seqcomb' and 'seq' is that the containers of the subcompressors 'overlap'. For example, 'seqcomb(u "-" u "-" u)' only uses one container to store all three integers. In general, 'seqcomb' takes the maximum of the number of containers required by the subcompressors. In practice, 'seqcomb' will often lead to surprisingly different compression results. For example, consider some structured values (such as SS#, IP address, or date) with only a small number of different values. For example, a large number of elements might refer to the same IP address as the default router. Then, it is often better to store the single IP bytes together instead of in four separate containers, since Lempel-Ziv can exploit more similarities between entire IP numbers. t - Plain text compressor. The plain text compressor is the default user compressor and simply copies the text item into the container. u - Compressor for unsigned integers. This user compressor is very similar to the user compressor 'i' for signed integers. It only accepts positive integers. It also allows an optional additional parameter specifying the minimum number of digits. u8 - Compressor for integers between 0 and 255. This compressor is similar to 'u', but accepts a smaller range of integers. These integers are always stored in a single byte. Like in 'di', 'i', and 'u', an additional minimum number of digits can be specified. "..." - Constant compressor. The constant compressor only verifies that the given text item is equivalent to the constant. Otherwise, the compressor rejects the string. The constant compressor does not store any data in any containers. The constant compressor is useful for specifying small sets of distinct values. For example, one might specify that a price could be a number, 'low' or 'high': '//price=>or(u "low" "high")'. 5. Options of the Decompressor Xdemill The options of the decompressor are largely used for controlling the output format if white spaces are omitted on the compressor site. The general syntax of the decompessor XDemill is as follows: xdemill [-i file] [-v] [-c] [-d] [-r] [-os num] [-ot] [-oz] [-od] [-ou] file ... -i file - include options from file -c - write on standard output -d - delete input files -f - force overwrite of output files -os num - output formatted XML with space indentation -ot - output formatted XML with tabular indentation -oz - output unformatted XML (without white spaces) -od - uses DOS newline convention (default) -ou - uses UNIX newline convention file ... - compressed file(s) to be decompressed; for each file 'name.xmi' a new file called 'name.xml' will be generated. The detailed descriptions of the options follows: -i file - Reads additional options from text file 'file'. The text file must contain a sequence of options that are separated by white spaces. -c - Write the XML output to the standard output instead of to some file. This allows the user to pipe the output directly into some other process. If no file is specified in the command line and option '-c' is specified, then the program also reads the input from the standard input. -d - Deletes the input XML files after the compression. The default is to keep the input file. -f - Overwrites any existing compressed files. If this option is not specified, then the user is asked for every existing output file whether it should be overwritten. -os num - Outputs the XML data with space indendation. The number of spaces is specified by 'num'. Note that 'num' can also be zero. In this case, element tags are printed on separate lines with no indendation. Space indendation with one single space is the default, if the compressed file does not contain complete white spaces. -ot - Outputs the XML data with tabulator indentation. Each indentation is represented by one tabulator character. -oz - No white spaces are added to the XML data. This is the default, if the compressed file already stores white spaces. -od - Uses DOS notation for printing newlines in formated XML -ou - Uses UNIX notation for printing newlines in formated XML 6. The Compression Verbose Mode To analyse the effect of specified path expressions and user compressors, the user can choose the 'verbose mode' (option '-v') to prints important statistical information about the compression achieved for each single container. The information is always printed to the standard output. For each 'run' (i.e. each the container data stored when the memory window is exhausted) produces a separate section in the output. At the beginning of the section, information about the compression of the structure, white space and special data container is printed. Afterwards, the compression details for each of the data container is displayed. For example, consider the following command line for compressing the content of weblog data (4MByte) : xmill -w -v weblog.xml Note that we also preserve white spaces in the compressed file (option 'w'). The verbose output has the following form: Structure: 504693 ==> 4923 (0.975444%) Whitespaces: 582175 ==> 4459 (0.765921%) Special: 0 ==> Small... Sum: 1086868 ==> 9382 (0.863214%) //# <- CLF:host 0: 141432 ==> 23811 (16.835652%) //# <- CLF:requestLine 0: 309891 ==> 22199 (7.163487%) //# <- CLF:contentType 0: 101496 ==> 3548 (3.495704%) //# <- CLF:statusCode 0: 40000 ==> 1908 (4.770000%) //# <- CLF:date 0: 200000 ==> 19235 (9.617500%) //# <- CLF:byteCount 0: 42061 ==> 10869 (25.841040%) //# <- CLF:referer 0: 276922 ==> 27856 (10.059150%) //# <- CLF:userAgent 0: 400736 ==> 24348 (6.075820%) //# <- CLF:cooke2 0: 7733 ==> 399 (5.159705%) Header: Uncompressed: 207 Compressed: 176 (85.024155%) Structure: Uncompressed: 504693 Compressed: 4923 (0.975444%) Whitespaces: Uncompressed: 582175 Compressed: 4459 (0.765921%) Data: Uncompressed: 1520271 Compressed: 134173 (8.825598%) -------------------- Sum: Compressed: 143731 The first few lines represent the information about the structure, white space, and special data containers. The left column shows the uncompressed size of the data and the right column the compressed size. Note that the structure and the white spaces compressed extremely well, because of the regularity of the data. Since there are no special data sections in the XML file the corresponding container size if zero. Containers whose uncompressed size is smaller than 3KByte are not compressed directly by gzip. Instead, all such containers are grouped together and compressed in one single gzip run. This is more efficient, since gzip yields better results for larger data blocks. The compressed size of small containers is therefore not determinable and only described as 'Small...'. The lines afterwards represent the compression of the actual data containers. Each instantiation of symbol '#' in path expression '//#' (the default path expression) has an associated container with the data that from the correspondings path(s) in the XML file. For example, all 'CLF:hosts' fields in the XML file are stored in the first container and have an accumulated size of 141432 bytes. This compresses to 23811 bytes. The container compression details are follows by a summary of the compression achieved. The summary is divided into size parts: 1) the header, 2) the structure, and 3) the white spaces, 4) the special data, 5) the user compressor data, and 6) the actual data. In the example above the special data summary and the user compressor data are missing since they are empty. 1) The header summary shows the of the header. The header contains essential information about the compressed file, such as the label dictionary, the path expressions, the number and size of the containers, etc. 2) The structure summary shows the sum of the size of all structure containers. Recall that each run in the file has a separate structure container. 3) The white space summary shows the sum of the size of all white space containers. 4) The special data summary shows the sum of the size of all special data containers. 5) The user compressor summary shows the size of the user compressor data. Certain user compressors need to store data structures separate from the actual data. For example, the dictionary compressor must store the dictionaries. Section 6.1 shows a complex example of using dictionary compressors. 6) The actual data summary shows the overall size of all data containers. Finally, the overall size of the compressed file can be computed as the sum of all sections. 6.1. Applying User Compressors We have illustrated at the beginning of the section how the default compressor '//#' creates containers for each instantiation of '#'. As described in Section 4, the compression rate can be increased substantially by applying user compressors to the text items identified by the path expressions. The following set of path expressions with user compressors is used for compresing weblog data: -p//CLF:host=>seqcomb(u8 "." u8 "." u8 "." u8) -p//CLF:userAgent=>rt:seq(e "/" e) -p//CLF:byteCount=>u -p//CLF:cookie2=>e -p//CLF:statusCode=>e -p//CLF:contentType=>e -p//CLF:requestLine=>seq("GET " rep("/" e) " HTTP/1." e) -p//CLF:date=>seq(u "/" u8(2) "/" u8(2) "-" u8(2) ":" di(2) ":" di(2)) -p//CLF:referer=>or(seq("file:" t) seq("http://" or(seq(rep("." e) "/" rep("/" e)) rep("." e))) t) Here, the user compressor 'seqcomb(u8 "." u8 "." u8 "." u8)' compressed IP numbers (note that we use seqcomb instead of 'seq' to achieve higher compression rates!). The text items in 'CLF:userAgent' are separated by symbol '/' and stored in two dictionaries for the left and the right part, respectively. Note that we use the option 'rt' to treat right white spaces as normal text. Similarly, the other user compressors are chosen to achieve high compression rates based on the syntax of the text items. The following shows the verbose output for the user compressor specification shown above: Structure: 494835 ==> 5412 (1.093698%) Whitespaces: 562428 ==> 4474 (0.795480%) Special: 0 ==> Small... Sum: 1057263 ==> 9886 (0.935056%) //CLF:host=>seqcomb(u8 "." u8 "." u8 "." u8) <- 0: 40000 ==> 15856 (39.640000%) //CLF:requestLine=>seq("GET " rep("/" e) " HTTP/1." e) <- Enum: 6411 ==> 2792 (43.550148%) Enum: 4 ==> Small... 0: 9833 ==> 2796 (28.434862%) 1: 28921 ==> 10625 (36.738010%) 2: 9833 ==> 266 (2.705176%) Sum: 55002 ==> 16479 (29.960729%) //CLF:contentType=>e <- Enum: 92 ==> Small... 0: 9982 ==> 2079 (20.827489%) Sum: 10074 ==> 2079 (20.637284%) //CLF:statusCode=>e <- Enum: 36 ==> Small... 0: 10000 ==> 1387 (13.870000%) Sum: 10036 ==> 1387 (13.820247%) //CLF:date=>seq(u "/" u8(2) "/" u8(2) "-" u8(2) ":" di(2) ":" di(2)) <- 0: 20000 ==> 45 (0.225000%) 1: 10000 ==> 34 (0.340000%) 2: 10000 ==> 33 (0.330000%) 3: 10000 ==> 44 (0.440000%) 4: 10000 ==> 385 (3.850000%) 5: 10000 ==> 3974 (39.740000%) Sum: 70000 ==> 4515 (6.450000%) //CLF:byteCount=>u <- 0: 18957 ==> 9238 (48.731339%) //CLF:referer=>or(seq("file:" t) seq("http://" or(seq(rep("." e) "/" rep("/" e)) rep("." e))) t) <- Enum: 3054 ==> 2027 (66.371971%) Enum: 9013 ==> 4717 (52.335515%) Enum: 3763 ==> 2561 (68.057401%) 0: 8873 ==> 156 (1.758143%) 1: 789 ==> Small... 2: 8830 ==> 1458 (16.511891%) 3: 5347 ==> 395 (7.387320%) 4: 16884 ==> 2441 (14.457475%) 5: 5347 ==> 1376 (25.734056%) 6: 10184 ==> 5502 (54.025923%) 7: 3483 ==> 406 (11.656618%) 8: 11428 ==> 2441 (21.359818%) 9: 973 ==> Small... Sum: 86206 ==> 23480 (27.237083%) //CLF:userAgent=>rt:seq(e "/" e) <- Enum: 598 ==> Small... Enum: 19562 ==> 3922 (20.049075%) 0: 9858 ==> 956 (9.697707%) 1: 13631 ==> 9074 (66.568850%) Sum: 43649 ==> 13952 (31.964077%) //# <- CLF:cooke2 0: 7733 ==> 399 (5.159705%) //# <- CLF:requestLine 0: 3570 ==> 598 (16.750700%) //# <- CLF:userAgent 0: 2012 ==> 275 (13.667992%) Header: Uncompressed: 3010 Compressed: 1738 (57.740864%) Structure: Uncompressed: 494835 Compressed: 5412 (1.093698%) Whitespaces: Uncompressed: 562428 Compressed: 4474 (0.795480%) User Compressor: Uncompressed: 42533 Compressed: 16019 (37.662521%) Data: Uncompressed: 304706 Compressed: 72239 (23.707771%) -------------------- Sum: Compressed: 99882 Similar to the compression without any user compressor specification, the output is divided into information about structure, white space, special data, and actual data containers. The main difference to the default compression based on '//#' is that each instantiation of path expressions can have several associated containers used by the underlying user compressor. Consider for example the second path expression: //CLF:requestLine=>seq("GET " rep("/" e) " HTTP/1." e) <- Enum: 6411 ==> 2792 (43.550148%) Enum: 4 ==> Small... 0: 9833 ==> 2796 (28.434862%) 1: 28921 ==> 10625 (36.738010%) 2: 9833 ==> 266 (2.705176%) Sum: 55002 ==> 16479 (29.960729%) The 'seq' compressor parses HTTP request strings of the form "GET <directory> HTTP/1.?" where <directory> is a UNIX style directory path with separator '/'. The protocal can either be 'HTTP/1.0' or 'HTTP/1.1'. Hence, the '?'-symbol can either be '0' or '1'. The repeat compressor 'rep("/" e)' divides the directory path into its components and stores the directory names in a dictionary. There are two dictionary compressors involved. Each of them has maintaines a dictionary that is stored in the compressed file. The first dictionary (in 'rep("/" e)' ) has 6411 bytes, while the second dictionary has only 4 bytes, since it has only two elements: '0' and '1' for the protocol version in 'HTTP/1.?'. Furthermore, the repeat compressor requires one container that stores the number of path components - i.e. components separated by '/'. This is container 0 in the list above. The size is 9833 bytes. Furthermore, the dictionary compressor 'e' within the repeat compressor stores the dictionary index for each component of the path. This is container 1 in the list above with size 28921 bytes. Lastly, the second dictionary compressor stores the dictionary indices (0 or 1) for the string '0' or '1' in 'HTTP/1.?'. The size for this container is 9833. The order of the containers in the verbose output is directly derived from the order of the user compressors in the user compressor string. However, the additional user compressor data (such as the dictionaries of dictionary compressors) are stored separately and its information is printed at the beginning of the list. The last line contains the sum of the (un)compressed data. Note the improvement that is achieved over the original compression using '//#': //# <- CLF:requestLine 0: 309891 ==> 22199 (7.163487%)