An Efficient Compressor for XML

(Mirror from AT&T Labs-Research)


News: XMill has been implemented in summer 1999 at AT&T Labs Research in New Jersey, USA, and is not developed any further by the original developers. The XMill source code has now moved into a SourceForge project. Release 0.8 is downloadable there.


XMill is a new tool for compressing XML data efficiently. It is based on a regrouping strategy that leverages the effect of highly-efficient compression techniques in compressors such as gzip. XMill groups XML text strings with respect to their meaning and exploits similarities between those text strings for compression. Hence, XMill typically achieves much better compression rates than conventional compressors such as gzip.

XML files are typically much larger than the same data represented in some reasonably efficient domain-specific data format. One of the most intriguing results of XMill is that the conversion of proprietary data formats into XML will in fact improve the compression - i.e. the the compressed XML file is (up to twice) smaller than the compressed original file! And this astonishing compression improvement is achieved at about the same compression speed.

Download XMill version 0.7

Licensing

XMill is essentially free software. Please read the license carefully: if you do not agree with the terms of the release, do not download or use this software. The license allows you to modify XMill and distribute the new software, but you must agree to certain terms. Among others your license agreement must satisfy certain conditions. To simplify this procedure for you, the distribution package includes MINTERMS.txt, a default license agreement you may use for distributing the modified software.

Creditors

Technical Overview

XMill is based on grouping technique that groups text items with similar syntax and meaning and compresses them together. A brief overview over the underlying principles can be found here. The user manual contains a description and the syntax of the pre-defined user compressors. The UPenn technical report MS-CIS-99-26 describes the fundamental principles of XMill in detail.

Mailing List

If you would like to receive news related to updates and new releases of XMill, please subscribe to xmill@research.att.com. For that, send email to majordomo@research.att.com, with the body consisting of:

subscribe xmill [<address>].

Experiments

We have tested XMill on several large real-life XML data sources. Some of the experimental results and comparisons with existing compressors are shown here. The data sources and their XML representation are described here.


Copyright © 2004 Hartmut Liefke