The task was clear: we needed a compression program that was capable of compressing a large amount of data into a small piece. In particular, an SQL dump had to be compressed which was rapidly growing every day, and transmitted over a relatively slow line.
The original file:
959M edispo.sql
The first guess: bzip2. bzip2 has a fairly small memory footprint, but
the compression operation took 8 minutes on a dual core AMD Opteron.
Decompression took 27 seconds. The resulting file:
35M edispo.sql.bz2
This was way too large. Within a couple of weeks, the file would cross
the 100MB boundary. So we tried another candidate: rzip. rzip uses
a vast amount of memory for its dictionaries, but after only 1 minute(!),
the end result is quite impressive. The decompression took 33 seconds,
slightly longer than with bzip2. The resulting file:
4.3M edispo.sql.rz
After this impressive result, we tried another competitor: lzma. With
lzma, we had to wait a very long time again: 14.5 minutes. At all of that
time, the memory of the machine was almost exhausted. Decompression however
went almost without used memory, and after 50 seconds, the file was
decompressed. The resulting file:
3.7M edispo.sql.lzma
The rest of the compression algorithms were well above that number. However, as it turned out, rzip was not all that useful on smaller files. However, there is a combined algorithm called lrzip, which uses lzma as a function in the rzip algorithm, which is said to have even better compression on large files, while still being useful on smaller ones. However, lrzip was not in pkgsrc.
| Algorithm | File size | Compression | Decompression | Memory use |
|---|---|---|---|---|
| cp | 959M | 0 min 34 sec | 0 min 34 sec | 0% |
| bzip2 | 35M | 8 min 42 sec | 0 min 27 sec | 2% |
| rzip | 4.3M | 1 min 3 sec | 0 min 33 sec | 20% |
| lzma | 3.7M | 14 min 32 sec | 0 min 50 sec | 90% |
The favorite algoritm in this pack here is certainly rzip. While lzma still features a slightly better compression ratio, it is very intensive in terms of time and memory in doing so. If the most important constraint is really space, one should most likely go for lzma. However, when it comes to normal-life tasks, rzip will most likely do the job just as well.
The only problem with the rzip algorithm is that it is impossible to pipe it. Thus, it is impossible to use it as an intermediate algorithm in data processing, nor could it be used as a link-layer compression for some protocol. However, for compression of large files, it is most likely the best algorithm you can get.
For those people who wonder why copying the file takes longer actually than to decompress it, the answer is pretty simple: it is easy to load a compressed 3.7M file into memory and only write the 959M output file once, than to seek on the disk between input and output file. If the file is copied, cp fills its buffers with data from the input file and writes it to the output file. For this purpose, a pretty large buffer is needed, otherwise the hard drive has to seek between the two files all the time. This takes a lot of time.
As a proof: copying the file with a 256M buffer takes only 24 seconds, while copying the file with a 64k buffer takes the full 33 seconds.