Examples

We have tested the performance of XMill for six data sources from rather different domains:

Weblog - logs of HTTP requests in Apache WWW servers
SWISS-PROT - information about DNA sequence and protein data
Treebank - linguistic annotations of Wall Street Journal sentences
TPC-D - artifically generated relational database for transaction benchmark tests
Database and Logic Programming (DBLP) - bibliographical database for publications in databases and logic programming
Shakespeare - lyrics of Shakespeare plays annotated with structural information (such as speaker, start/end of an act, ...)

The first four data sources come in a proprietary format, while the DBLP data and Shakespeare were available in XML. The original data sources have a size of 35MB-100MB (except Shakespeare with a size of about 7.6MB) and the converted XML documents have a size of 57MB to 172MB.

Weblog Data

Weblog data contains information about HTTP requests - i.e. information about the host IP number, the URL, the size of the reply packet, etc. For example, a single line in the Weblog data could have the following format:

202.239.238.16|GET /a/b.html HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.abc-net.or.jp/|Mozilla/3.01 [ja] (Win95; I)

The single values are separated by | and missing field elements are represented by a single '-'. This data is collected over long periods and can therefore occupy a large amount of disk space. For our purposes, we only consider a small sample with about 350.000 entries. The original file has a size of 57.7MB and using gzip, we can compress it to 5.92MB.

To gain flexbility, we can convert the data into XML, as the following example illustrates:

<apache:entry> 
 <apache:host> 202.229.245.18 </apache:host> 
 <apache:requestLine> GET /a/b.html HTTP/1.0 </apache:requestLine> 
 <apache:contentType> text/html </apache:contentType> 
 <apache:statusCode> 200 </apache:statusCode> 
 <apache:date> 1997/10/01-00:00:02 </apache:date> 
 <apache:byteCount> 4478 </apache:byteCount> 
 <apache:referer> http://www.abc-net.or.jp/ </apache:referer> 
 <apache:userAgent> Mozilla/3.01 [ja] (Win95; I) </apache:userAgent> 
</apache:entry>

The size increases substantially, both for the XML file (172MB) and for its compressed version (7.59MB). However, XMill compresss the XML file to 3.02MB, i.e. about half the compressed size of the original data format!

The SWISS-PROT Protein Database

Like many biological databases, SWISS-PROT is highly nested and therefore very suitable for XML. SWISS-PROT is based on the biological data format EMBL. A sample entry is shown in the following lines. We omit several parts, such as the actual amino-acid sequence and the comments:

ID   108_LYCES      STANDARD;      PRT;   102 AA.
AC   Q43495;
DT   15-JUL-1999 (Rel. 38, Created)
DT   15-JUL-1999 (Rel. 38, Last sequence update)
DT   15-JUL-1999 (Rel. 38, Last annotation update)
DE   PROTEIN 108 PRECURSOR.
OS   Lycopersicon esculentum (Tomato).
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons;
OC   core eudicots; Asteridae; euasterids I; Solanales; Solanaceae;
OC   Solanum.
RN   [1]
RP   SEQUENCE FROM N.A.
RC   STRAIN=CV. VF36; TISSUE=ANTHER;
RX   MEDLINE; 94143497.
RA   CHEN R., SMITH A.G.;
RL   Plant Physiol. 101:1413-1413(1993).
DR   EMBL; Z14088; CAA78466.1; -.
DR   MENDEL; 8853; LYCes;1133;1.
KW   Signal.
FT   SIGNAL        1     30       POTENTIAL.
FT   CHAIN        31    102       PROTEIN 108.
FT   DISULFID     41     77       BY SIMILARITY.
FT   DISULFID     51     66       BY SIMILARITY.
FT   DISULFID     67     92       BY SIMILARITY.
FT   DISULFID     79     99       BY SIMILARITY.
SQ   SEQUENCE   102 AA;  10576 MW;  AFA4875A CRC32;
     MASVKSSSSS SSSSFISLLL LILLVIVLQS QVIECQPQQS CTASLTGLNV CAPFLVPGSP
     TASTECCNAV QSINHDCMCN TMRIAAQIPA QCNLPPLSCS AN
//

In the current release of SWISS-PROT, there are about 80,000 entries. This accumulates to 98.5MB of data (the sequence data and comments are omitted) and the compressed version (gzip) has a size of 15.7MB. It is relatively easy to convert SWISS-PROT data to XML. The following XML document shows one possible way to do so:

<Entry id="108_LYCES" class="STANDARD" mtype="PRT" seqlen="102"> 
 <AC> Q43495</AC> 
 <Mod date="15-JUL-1999" Rel="38" type="Created"> </Mod> 
 <Mod date="15-JUL-1999" Rel="38" type="Last sequence update"> </Mod> 
 <Mod date="15-JUL-1999" Rel="38" type="Last annotation update"> </Mod> 
 <Descr> PROTEIN 108 PRECURSOR</Descr> 
 <Species> Lycopersicon esculentum (Tomato)</Species> 
 <Org> Eukaryota</Org>  <Org> Viridiplantae</Org>   ...  <Org> Solanum</Org> 
 <Ref num="1" pos="SEQUENCE FROM N.A"> 
  <Comment> STRAIN=CV. VF36</Comment> 
  <Comment> TISSUE=ANTHER</Comment> 
  <DB> MEDLINE</DB> 
  <MedlineID> 94143497</MedlineID> 
  <Author> CHEN R</Author>  <Author> SMITH A.G</Author> 
  <Cite> Plant Physiol. 101:1413-1413(1993)</Cite> 
 </Ref> 
 <EMBL prim_id="Z14088" sec_id="CAA78466"> </EMBL> 
 <MENDEL prim_id="8853" sec_id="LYCes" status="1133"> </MENDEL> 
 <Keyword> Signal</Keyword> 
 <Features> 
  <SIGNAL from="1" to="30">     <Descr> POTENTIAL</Descr>      </SIGNAL> 
  <CHAIN from="31" to="102">    <Descr> PROTEIN 108</Descr>    </CHAIN> 
  <DISULFID from="41" to="77">  <Descr> BY SIMILARITY</Descr>  </DISULFID> 
  ...
 </Features> 
</Entry>

Because of the addition of tags, the file size increased to 158MB. Using XMill with user compression, it is possible to reduce the size to 8.73MB - i.e. again almost half the size of the compressed original file.

The TreeBank Linguistic Database

The TreeBank project at the University of Pennsylvania creates, stores, and maintaines annotations of sentences from newspapers and other information sources. Our Treebank example contains annotated sequences from the Wall Street Journal. All words are annotated with their linguistic meaning, such a NP for noun-phrase or DT for determiner. For example, the sentence "A stockbroker is an example of a profession in trade and finance" has the following format in Treebank:


(S (`` ``)
   (NP (DT A) (NN stockbroker))
   (VP (VBZ is)
       (NP (DT an) (NN example)
           (PP (IN of)
               (NP (DT a) (NN profession)
                   (PP (IN in)
                       (NP (NN trade) (CC and) (NN finance)))))))
   (. .))

This LISP-like notation can easily be converted into XML:


<S>
 <NP>
  <DT>A</DT>
  <NN>stockbroker</NN>
 </NP>
 <VP>
  <VBZ>is</VBZ>
  <NP>
   <DT>an</DT>
   <NN>example</NN>
   <PP>
    <IN>of</IN>
    <NP>
     <DT>a</DT>
     <NN>profession</NN>
     <PP>
      <IN>in</IN>
      <NP>
       <NN>trade</NN>
       <CC>and</CC>
       <NN>finance</NN>
      </NP>
     </PP>
    </NP>
   </PP>
  </NP>
 </VP>
 <.>.</.>
</S>

The original data has a size of 39.6MB while the XML representation takes 53.8MB. With gzip, the data can be compressed to 5.97MB and 7.71MB, respectively. Using XMill, however, the same data can compressed to 2.73MB.

The TPC-D Benchmark Database

The TPC benchmark tests are a popular mechanism for evaluating the query and update performance of databases. The TPC-D benchmark is based on a databases that models suppliers, items, lines, customers, countries, etc. Altogether, the TPC-D benchmark contains 8 relations. The tuples are generated randomly by using certain syntax and distributions specific to the attribute. For example, dates are always of the form 'YYYY-MM-DD'. Each line in the original TPC-D output format represents a single tuple. Attribute values are separated by '|'. For example, the following line could represent a customer tuple:

4|Customer#000000004|mknn1Sh0NPMz1k5Lw2OB mO|14-128-190-5944|2866.83|MACHINERY

Each line in the TPC-D relations is convered into the an XML element with subelements representing the attribute values. For example, the customer tuple is represented as follows;


<Customer>
 <Key>4</Key>
 <Name>Customer#000000004</Name>
 <Addr>mknn1Sh0NPMz1k5Lw2OB mO</Addr>
 <Nation>4</Nation>
 <Phone>14-128-190-5944</Phone>
 <Balance>2866.83</Balance>
 <MktSeg>MACHINERY</MktSeg>
</Customer>

Note that this verbose representation increases the size of the database substantially. Our original TPC-D test database has a size of 34.6MB, while the same database in XML has a size of 119MB. Using XMill, it is however possible to compress the XML file to 5.76MB, while gzip only achieves 9.63MB for the original file.

The DBLP Bibliography Database

The DBLP Database contains bibliographical references for databases and logic programming research. The underlying data is stored in plain XML files. For example, a single entry could have the following form:


<book key="MeltonS93">
 <author>Jim Melton</author>
 <author>Alan R. Simon</author>
 <title>Understanding the New SQL: A Complete Guide</title>
 <publisher href="db/publishers/mkp.html">Morgan Kaufmann</publisher>
 <year>1993</year>
 <isbn>1-55860-245-3</isbn>
 <url>db/books/dbtext/melton93.html</url>
</book>

The DBLP database has a size of 47.2MB. Using gzip, the file can be compressed to 8.29MB, while XMill compresses the file to 5.93MB using user compressors. Note that the XMill compression does not improve as much as for other data sources, such as the TPC-D or weblog data. The reason is that a large part of the DBLP database is plain text - titles and author names - whose compression cannot be improved significantly over existing compression techniques. However, the extensibility of XMill allows the user to write own user compressors (e.g. for database publication titles) that can take advantage of the specific characteristics of text items and can therefore yield higher compression rates.

Shakespeare Plays in XML

The Shakespeare XML Database is a collection of shakespeare lyrics that have been converted into XML. The follows shows an excerpt from one of the plays:


<STAGEDIR>Enter an Attendant</STAGEDIR>

<SPEECH>
<SPEAKER>Attendant</SPEAKER>
<LINE>News, my good lord, from Rome.</LINE>
</SPEECH>

<SPEECH>
<SPEAKER>MARK ANTONY</SPEAKER>
<LINE>Grates me: the sum.</LINE>
</SPEECH>

The XML file has a size of 7.3MB, which is compressed to 2.06MB using gzip Using XMill, the file is compressed to 1.66MB. Similar to the DBLP databases, the size of the XML file is dominated by plain text, which is compressed within XMill using gzip. The addition of specialized user compressors can improve the compression rate further.