To gain flexbility, we can convert the data into XML, as the following example
illustrates:
Each line in the TPC-D relations is convered into the an XML element
with subelements representing the attribute values. For example, the
customer tuple is represented as follows;
Weblog Data
Weblog data contains information about HTTP requests - i.e.
information about the host IP number, the URL, the size of the reply
packet, etc. For example, a single line in the Weblog data could have
the following format:
202.239.238.16|GET /a/b.html HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.abc-net.or.jp/|Mozilla/3.01 [ja] (Win95; I)
The single values are separated by |
and missing field
elements are represented by a single '-
'. This data is
collected over long periods and can therefore occupy a large amount of
disk space. For our purposes, we only consider a small sample with
about 350.000 entries. The original file has a size of 57.7MB and
using gzip
, we can compress it to 5.92MB.
The size increases substantially, both for the XML file (172MB)
and for its compressed version (7.59MB).
However, XMill compresss the XML file to 3.02MB, i.e. about half
the compressed size of the original data format!
<apache:entry>
<apache:host> 202.229.245.18 </apache:host>
<apache:requestLine> GET /a/b.html HTTP/1.0 </apache:requestLine>
<apache:contentType> text/html </apache:contentType>
<apache:statusCode> 200 </apache:statusCode>
<apache:date> 1997/10/01-00:00:02 </apache:date>
<apache:byteCount> 4478 </apache:byteCount>
<apache:referer> http://www.abc-net.or.jp/ </apache:referer>
<apache:userAgent> Mozilla/3.01 [ja] (Win95; I) </apache:userAgent>
</apache:entry>
The SWISS-PROT Protein Database
Like many biological databases, SWISS-PROT
is highly nested and therefore very suitable for XML. SWISS-PROT is based
on the biological data format EMBL. A sample entry is shown in the following
lines. We omit several parts, such as the actual amino-acid sequence and the
comments:
In the current release of SWISS-PROT, there are about 80,000 entries.
This accumulates to 98.5MB of data (the sequence data and comments are omitted)
and the compressed version (
ID 108_LYCES STANDARD; PRT; 102 AA.
AC Q43495;
DT 15-JUL-1999 (Rel. 38, Created)
DT 15-JUL-1999 (Rel. 38, Last sequence update)
DT 15-JUL-1999 (Rel. 38, Last annotation update)
DE PROTEIN 108 PRECURSOR.
OS Lycopersicon esculentum (Tomato).
OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons;
OC core eudicots; Asteridae; euasterids I; Solanales; Solanaceae;
OC Solanum.
RN [1]
RP SEQUENCE FROM N.A.
RC STRAIN=CV. VF36; TISSUE=ANTHER;
RX MEDLINE; 94143497.
RA CHEN R., SMITH A.G.;
RL Plant Physiol. 101:1413-1413(1993).
DR EMBL; Z14088; CAA78466.1; -.
DR MENDEL; 8853; LYCes;1133;1.
KW Signal.
FT SIGNAL 1 30 POTENTIAL.
FT CHAIN 31 102 PROTEIN 108.
FT DISULFID 41 77 BY SIMILARITY.
FT DISULFID 51 66 BY SIMILARITY.
FT DISULFID 67 92 BY SIMILARITY.
FT DISULFID 79 99 BY SIMILARITY.
SQ SEQUENCE 102 AA; 10576 MW; AFA4875A CRC32;
MASVKSSSSS SSSSFISLLL LILLVIVLQS QVIECQPQQS CTASLTGLNV CAPFLVPGSP
TASTECCNAV QSINHDCMCN TMRIAAQIPA QCNLPPLSCS AN
//
gzip
) has a size of 15.7MB.
It is relatively easy to convert SWISS-PROT data to XML. The following XML document
shows one possible way to do so:
Because of the addition of tags, the file size increased to 158MB. Using
XMill with user compression, it is possible to reduce the size to 8.73MB -
i.e. again almost half the size of the compressed original file.
<Entry id="108_LYCES" class="STANDARD" mtype="PRT" seqlen="102">
<AC> Q43495</AC>
<Mod date="15-JUL-1999" Rel="38" type="Created"> </Mod>
<Mod date="15-JUL-1999" Rel="38" type="Last sequence update"> </Mod>
<Mod date="15-JUL-1999" Rel="38" type="Last annotation update"> </Mod>
<Descr> PROTEIN 108 PRECURSOR</Descr>
<Species> Lycopersicon esculentum (Tomato)</Species>
<Org> Eukaryota</Org> <Org> Viridiplantae</Org> ... <Org> Solanum</Org>
<Ref num="1" pos="SEQUENCE FROM N.A">
<Comment> STRAIN=CV. VF36</Comment>
<Comment> TISSUE=ANTHER</Comment>
<DB> MEDLINE</DB>
<MedlineID> 94143497</MedlineID>
<Author> CHEN R</Author> <Author> SMITH A.G</Author>
<Cite> Plant Physiol. 101:1413-1413(1993)</Cite>
</Ref>
<EMBL prim_id="Z14088" sec_id="CAA78466"> </EMBL>
<MENDEL prim_id="8853" sec_id="LYCes" status="1133"> </MENDEL>
<Keyword> Signal</Keyword>
<Features>
<SIGNAL from="1" to="30"> <Descr> POTENTIAL</Descr> </SIGNAL>
<CHAIN from="31" to="102"> <Descr> PROTEIN 108</Descr> </CHAIN>
<DISULFID from="41" to="77"> <Descr> BY SIMILARITY</Descr> </DISULFID>
...
</Features>
</Entry>
The TreeBank Linguistic Database
The TreeBank
project at the University of Pennsylvania creates, stores, and
maintaines annotations
of sentences from newspapers and other information sources.
Our Treebank example contains annotated sequences from the
Wall Street Journal. All words are annotated with their
linguistic meaning, such a NP
for noun-phrase
or DT
for determiner.
For example, the sentence
"A stockbroker is an example of a profession in trade and finance"
has the following format in Treebank:
This LISP-like notation can easily be converted into XML:
(S (`` ``)
(NP (DT A) (NN stockbroker))
(VP (VBZ is)
(NP (DT an) (NN example)
(PP (IN of)
(NP (DT a) (NN profession)
(PP (IN in)
(NP (NN trade) (CC and) (NN finance)))))))
(. .))
The original data has a size of 39.6MB while the XML representation
takes 53.8MB. With
<S>
<NP>
<DT>A</DT>
<NN>stockbroker</NN>
</NP>
<VP>
<VBZ>is</VBZ>
<NP>
<DT>an</DT>
<NN>example</NN>
<PP>
<IN>of</IN>
<NP>
<DT>a</DT>
<NN>profession</NN>
<PP>
<IN>in</IN>
<NP>
<NN>trade</NN>
<CC>and</CC>
<NN>finance</NN>
</NP>
</PP>
</NP>
</PP>
</NP>
</VP>
<.>.</.>
</S>
gzip
, the data can be compressed
to 5.97MB and 7.71MB, respectively. Using XMill, however,
the same data can compressed to 2.73MB.
The TPC-D Benchmark Database
The TPC benchmark tests are a popular
mechanism for evaluating the query and update performance of databases.
The TPC-D benchmark is based on a databases that models suppliers,
items, lines, customers, countries, etc. Altogether, the TPC-D benchmark
contains 8 relations. The tuples are generated randomly by using certain
syntax and distributions specific to the attribute. For example, dates
are always of the form 'YYYY-MM-DD'.
Each line in the original TPC-D output format represents a single
tuple. Attribute values are separated by '|'.
For example, the following line could represent a customer tuple:
4|Customer#000000004|mknn1Sh0NPMz1k5Lw2OB mO|14-128-190-5944|2866.83|MACHINERY
Note that this verbose representation increases the size of the database
substantially. Our original TPC-D test database has a size of 34.6MB, while
the same database in XML has a size of 119MB.
Using XMill, it is however possible to compress the XML file to 5.76MB,
while
<Customer>
<Key>4</Key>
<Name>Customer#000000004</Name>
<Addr>mknn1Sh0NPMz1k5Lw2OB mO</Addr>
<Nation>4</Nation>
<Phone>14-128-190-5944</Phone>
<Balance>2866.83</Balance>
<MktSeg>MACHINERY</MktSeg>
</Customer>
gzip
only achieves 9.63MB for the original file.
The DBLP Bibliography Database
The DBLP Database contains
bibliographical references for databases and logic programming research.
The underlying data is stored in plain XML files. For example, a single entry
could have the following form:
The DBLP database has a size of 47.2MB. Using<book key="MeltonS93"> <author>Jim Melton</author> <author>Alan R. Simon</author> <title>Understanding the New SQL: A Complete Guide</title> <publisher href="db/publishers/mkp.html">Morgan Kaufmann</publisher> <year>1993</year> <isbn>1-55860-245-3</isbn> <url>db/books/dbtext/melton93.html</url> </book>
gzip
, the file
can be compressed to 8.29MB, while XMill compresses the file to 5.93MB
using user compressors.
Note that the XMill compression does not improve as much as for other
data sources, such as the TPC-D or weblog data. The reason is that a large
part of the DBLP database is plain text - titles and author names -
whose compression cannot be improved significantly over existing compression
techniques. However, the extensibility of XMill allows the user to write
own user compressors (e.g. for database publication titles) that can
take advantage of the specific characteristics of text items and
can therefore yield higher compression rates.
The XML file has a size of 7.3MB, which is compressed to 2.06MB using<STAGEDIR>Enter an Attendant</STAGEDIR> <SPEECH> <SPEAKER>Attendant</SPEAKER> <LINE>News, my good lord, from Rome.</LINE> </SPEECH> <SPEECH> <SPEAKER>MARK ANTONY</SPEAKER> <LINE>Grates me: the sum.</LINE> </SPEECH>
gzip
Using XMill, the file is compressed to 1.66MB.
Similar to the DBLP databases, the size of the XML file is dominated by
plain text, which is compressed within XMill using gzip
.
The addition of specialized user compressors can improve the compression
rate further.
Copyright © 2004 Hartmut Liefke