Silesia corpus
The Silesia corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 2003 as an alternative for the Canterbury corpus and Calgary corpus, based on concerns about how well these represented modern files. It contains various data types, including large text documents, executable files, and databases. [1]
Contents
[edit]The corpus consists of 12 files, totaling 211MB. The files were chosen to represent what the author considered to be data types likely to grow rapidly in size over time, such as computer programs and databases, along with more traditional compression benchmarks, such as large text files. [1]
File | Size (B) | Description | Type of data |
---|---|---|---|
dickens | 10192446 | The works of Charles Dickens | English text |
mozilla | 51220480 | Executable files for Mozilla 1.0 | Executable |
mr | 9970564 | MRI Images | 3D image |
nci | 33553445 | A database of chemical structures | Database |
office | 6152192 | A shared library from OpenOffice | Executable |
osdb | 10085684 | A Sample MySQL database from the Open Source Database Benchmark | Database |
reymont | 6625583 | The text of the book Chłopi by Władysław Reymont | PDF in Polish |
samba | 21606400 | The source code of Samba 2‑2.3 | Executable |
sao | 7251944 | The SAO star catalogue | Binary database |
webster | 41458703 | The 1913 Webster Unabridged Dictionary | HTML |
xml | 5345280 | Collected XML files | XML |
x-ray | 8474240 | A medical X-Ray | Image |
Total | 211938580 |
Because it has a broader and more modern selection of datatypes, it is considered a better source of test data for compression algorithms when compared to the Calgary corpus.[2]
See also
[edit]References
[edit]- ^ a b Deorowicz, Sebastian. Universal Lossless Data Compression Algorithms (PDF) (Thesis). Silesian University of Technology. pp. 93–95. Archived from the original (PDF) on 2024-08-28.
- ^ Gupta, Apoorv; Bansal, Aman; Khanduja, Vidhi (2017-02-22). "Modern lossless compression techniques: Review, comparison and analysis". Second International Conference on Electrical, Computer and Communication Technologies (ICECCT). IEEE: 1–8. doi:10.1109/ICECCT.2017.8117850. ISBN 978-1-5090-3239-6.
External links
[edit]