We have tested XMill extensively on several large real-life XML data sources, which are described in detail here. These data sources include weblog data, genetic data, bibliographical data, and linguistic data among others. For these data sources, XMill achieves substantially better compression rates (typically twice as good) as gzip at about the same speed.

The following table gives an overview of the compression rates achieved with gzip and XMill for six data sources. A description of each data source is available by clicking on the source name.

Size(orig) Size(XML) gzip(orig) gzip(xml) xmill // xmill //# xmill <u>
Weblog 57.7 MB 172 MB 5.92 MB 7.59 MB 6.27 MB 4.85 MB 3.02 MB
SwissProt 98.5 MB 158 MB 15.7 MB 18.4 MB 14.8 MB 10.7 MB 8.73 MB
Treebank 39.6 MB 53.8 MB 5.97 MB 7.71 MB 4.12 MB 3.24 MB 2.73 MB
TPC-D 34.6 MB 119 MB 9.63 MB 12.9 MB 9.71 MB 7.22 MB 5.76 MB
DBLP 47.2 MB 8.29 MB 7.82 MB 6.66 MB 5.93 MB
Shakespeare 7.3 MB 2.06 MB 1.96 MB 1.89 MB 1.66 MB

The first four data sources have been converted from domain-specific data formats. For example, the format used for weblog data is based on the Apache Log Format. The last two data sources, DBLP and Shakespeare, are already XML formatted.

One can observe that the size of the XML files increases substantially over the original file formats. Furthermore, the gzip-compressed original file (gzip(orig)) is smaller than the gzip-compressed XML file (gzip(XML)).

We applied three types of XMill compressions: xmill //, xmill //#, xmill <u>. The first compression method, xmill //, does not group the data elements at all. Hence, the results are similar to the result obtained using gzip. The second method, xmill //#, groups the text items depending on their parent element tag in the XML file. As a consequence, the compression rate increases. Until now, no user compression has been applied. The last method, xmill <u>, performs user compression based on pre-specified path expressions that provide information about domain-specific semantics and syntax of text items. The last method provides the best compression.

The following diagram illustrates the compression rate (bits/bytes) with respect to the original XML file. A compression rate of 2 bits/bytes means that the compressed file size is 25% of the uncompressed XML file size. Note that the last two data sources are already provided in XML. Therefore, gzip is only applied to the XML file and no gzip (orig) result is given..

Trading Space for Time

Many compression techniques already exist today and they can typically be classified with respect to their compression rate their compression speed. The compressors that achieve the highest compression rate are typically the slowest, while compressors with less compression are most often much faster. Hence, a classical trade-off between time and space exists.

Our goal is to classify XMill in this context and to compare it with existing compressors with respect to compression rate and compression time. The efficiency of XMill is largely dependent on the underlying compressor. In its standard configuration, XMill is equipped with gzip To obtain a more complete picture about the performance of XMill, we tested a second version of XMill based on bzip, a compressor that achieves typically better compression results, but at a significantly slower speed.

The following diagram shows the results in time and space of applying several existing compressors and XMill to the XML data sources described above. The (logarithmic) Y-axis shows the relative compression time with respect to the compression time of gzip. The X-axis shows the relative compression size with respect to gzip In other words, a point (X=0.5,Y=3) means that the compressor compresses the file twice as good as gzip, but is three times as slow.

All compressors are applied to all six data sources. We encircle the data points obtained for the first four data sources (Weblog, SwissProt, Treebank, TPC-D), since we believe that they represent an important class of XML documents where compression is particularly useful.

One can observe that both, XMill with gzip and XMill with bzip perform significantly better than comparable conventional compressors. Even though the UNIX compress is slightly faster than XMill, its compression rate is far worse. On the other hand, XMill beats all conventional compressors in terms of compression rate. XMill with gzip achieves similar compression rates as bzip, but as a substantially better speed (about 10 times faster!).

One interesting observation is that XMill with user compression (i.e. XMill <u>) is always better (in space and time!) than the corresponding XMill with default compression (i.e. XMill //#). This somewhat surprising result is due to the fact that user compressors pre-compress the data such that the following compression by gzip and bzip is performed on a much smaller data size.

Copyright © 2004 Hartmut Liefke