[Sugar-devel] Wikipedia and xz compression possibility

Daniel Drake dsd at laptop.org
Thu Dec 15 15:17:17 EST 2011


I spent some time looking at the possibility of using xz compression
for the Wikipedia activity content instead of bzip2 which is used
currently.

In general usage, xz compresses significantly better than bz2 and
decompresses much much faster. So I was hoping to produce a wikipedia
that could pack more content in the same disk space, and load faster.
Here are the results from my somewhat unscientific test:

Firstly I determined that liblzma does provide block-based access to
the archive, which is needed by the activity.


The data file in WikipediaES-31 is es_PE.xml.bz2.processed, 81mb
bzip2-compressed with a block size of 900kb. It takes my desktop 42
seconds to decompress it.
Uncompressed this is 295mb.

If I decompress it and recompress with xz (default settings), the
output compressed file is 72mb and takes 12 seconds to decompress. In
other words, decompression is around 3.5x faster, and the disk space
required dropped by around 10%.

However, xz -l shows that all the data was packed into a single block:

Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
    1       1     71,7 MiB    295,0 MiB  0,243  CRC64   es_PE.xml.xz

Like bzip2 we can only decompress an entire block, not a part of it,
so this is not usable for us - we would have to decompress the whole
file just to read 1 article.


If I set the block size to 900kb (like wikipedia uses for bzip2) then
things regress:
Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
    1     336     83,3 MiB    295,0 MiB  0,282  CRC64   es_PE.xml.xz

eek, it is now 3mb bigger than the existing bz2 archive.
Decompression time: 14s


I then tried a block size of 3150kb. If the data can be decompressed
3.5 times faster compared to bzip2, we can increase the block size by
3.5 times without increasing (or decreasing) the amount of time needed
for decompression of a block in the bzip2 solution. 900 * 3.5 = 3150.
This means we negate the decompression time reduction, but could still
save some disk space compared to bzip2. Results:

Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
    1      96     78,2 MiB    295,0 MiB  0,265  CRC64   es_PE.xml.xz

Decompression time: 13.2s



Conclusion:

If we accept a disk space *increase* of 3mb or around 4% (as achieved
with 900kb block size), we get an activity that will decompress pages
3 - 3.5 times faster.

If we want to avoid the file size increase, we can increase the block
size up to 3.5 times without regressing performance-wise. If we decide
to match the performance of the bzip2 solution (i.e. increase block
size by 3.5x) then we are left with 3mb of disk savings.


My thoughts: results are positive but not exciting, probably not worth
the time exploring further :(

Daniel


More information about the Sugar-devel mailing list