Etincelle - Fast efficient compression => x3.06 @ 105MB/s

View previous topic View next topic Go down

Etincelle - Fast efficient compression => x3.06 @ 105MB/s

Post  Yann on Fri 26 Mar - 1:38

Etincelle is a fast & efficient compression software with better-than-Zip compression performance on large files.

Etincelle especially shines at compressing large and very large container files (mailboxes, databases, VM, ISO, etc.), where its fast speed makes it crunch gigabytes of data as fast as HDD can feed it. Its capability to work on very large dictionnary sizes and detect repeated compressed segments provide a pretty good compression rate (see benchmark).

Etincelle belongs to ROLZ compression family, using order-1 context filter to reduce offset length, and a second pass Huffman entropy coding, provided by Huff0. It is approximately 3 times faster than Zip fastest mode, while still compressing large files better than Zip's best mode.


Download :
RC2 : Etincelle (win32)
- some small compression and speed improvements

Compression evaluation :
In-memory benchmark (-b) results, sample totalizing about 4GB.
Benchmark platform : Core 2 Duo E8400 (3GHz), Window Seven 32-bits

version
Compression Ratio
Speed
Decoding
EtincelleRC2
3.062
105 MB/s
202 MB/s




Last edited by Yann on Thu 26 Apr - 9:30; edited 31 times in total

Yann
Admin

Number of posts: 174
Registration date: 2008-05-01

http://phantasie.tonempire.net

Back to top Go down

Previous Releases

Post  Yann on Sat 27 Mar - 14:32

RC1 : Etincelle (win32)
- default memory size set to 128MB
- minor compression gain for binary files
- benchmark mode can accept very large files

beta 4 : Etincelle (win32)
- long repetition mode

beta 3 : Etincelle (win32)
- incompressible segment detection
- major speed improvements for files which contains some incompressible parts

beta 2 : Etincelle (win32)
- minor compression rate and speed improvement
- bugfix : correct behavior on too large memory selection

beta 1 : Etincelle (win32)
- selectable dictionary size (from 1MB to 3GB)

alpha 3 : Etincelle (win32)
- drag'n'drop interface support
- benchmark mode support

alpha 2 : Etincelle (win32)
- improved global speed
- bugfix on decoding i/o

alpha 1 : Etincelle (win32)
initial release


Last edited by Admin on Wed 21 Apr - 12:54; edited 10 times in total

Yann
Admin

Number of posts: 174
Registration date: 2008-05-01

http://phantasie.tonempire.net

Back to top Go down

Benchmarks

Post  Yann on Sat 27 Mar - 14:33

- Tested on metacompressor.com (Thanks Sportman !) ==> Pareto Frontier (best speed to compression ratio)
http://www.metacompressor.com/top.aspx?testfile=enwik8

-Tested on Large Text Compression Benchmark (Thanks Matt !) ==> Pareto Frontier (best speed to compression ratio)
http://mattmahoney.net/dc/text.html

- Tested on MaximumCompression.com ==> Pareto Frontier (best speed to compression ratio)
http://www.maximumcompression.com/data/summary_mf3.php

- Tested on Compression Ratings (Thanks Sami !) ==> Pareto Frontier (best speed to compression ratio)
http://compressionratings.com/i_etincelle.html

- Tested on Monster of Compression (Thanks Nania !) ==> Pareto Frontier (best speed to compression ratio)
http://heartofcomp.altervista.org/MOC/MOCATC.htm

- Graphical comparison of Fast compressors ==> Pareto Frontier (best speed to compression ratio)
http://phantasie.tonempire.net/pc-compression-f2/compression-benchmark-t96.htm#149

- Etincelle -bench enwik8 :
with a Core 2 Duo E8400 @ 3.0GHz :
- Compression speed : 95MB/s
- Decoding speed : 137MB/s

with a Core 2 Duo SP9400 @ 2.4GHz :
- Compression speed : 79MB/s
- Decoding speed : 112MB/s


Last edited by Admin on Sat 1 May - 10:02; edited 3 times in total

Yann
Admin

Number of posts: 174
Registration date: 2008-05-01

http://phantasie.tonempire.net

Back to top Go down

Re: Etincelle - Fast efficient compression => x3.06 @ 105MB/s

Post  Yann on Wed 7 Apr - 22:52

[From ppmx.ru forum] :

Hi

Beta3 is ready for testing, featuring some better than expected results. As it stands, the new segment detection algorithm makes wonders.

It can be downloaded here :
http://sd-1.archive-host.com/membres...elle-beta3.zip

Etincelle is able to find small segments of incompressible data within any file and remove them on the fly from the compression loop.
What's more, this detection comes at no perceptible cost for files which does not contain such segment.

Let's make some test to verify this claim :
Enwik8
beta2 : 35.76% - 97.1 MB/s - 132 MB/s
beta3 : 35.76% - 98.6 MB/s - 134 MB/s

That was my first worry : would the detector slow down files for which it is useless, such as enwik8, where no segment can qualify as "incompressible" ?
Apparently not. There is even a very small speed boost, due to other minor improvements (removed inefficiencies).

Let's take the other extreme, and try to compress an already compressed file :
Enwik8.7z
beta2 : 100.00% - 51 MB/s - 186 MB/s
beta3 : 100.00% - 430 MB/s - 1000 MB/s

Now that's talking, and it provides a usefull reference for "wall time", when ideal detection conditions are present.


Making life a little harder, we are going to test a more difficult sample,
a sound.wav file, which is neither really compressible, neither completely incompressible. No filter is provided, this is direct LZ compression algorithm.
Sound.Wav
beta2 : 90.90% - 50 MB/s - 125 MB/s
beta3 : 90.91% - 150 MB/s - 250 MB/s

Not bad at all; the minor compression loss is more than offset by speed gains. Indeed, there was a real risk in this situation that the detector would either too rarely trigger, or would miss too many hits. Apparently, it succeeds nicely at keeping a good ratio.


Now let's deal with real life examples.
One torture test i had in mind since the beginning was a Win98.vmdk virtual HDD. It is a ~300MB file, within which a fair part (about 20%) consists of CAB files (microsoft compressed cabinets) somewhat scattered within virtual segments.
Now this is a particularly difficult situation, no clear file separation, no extension to help detection, not even guarantee that files are written in a single continuous location (they can be scattered between several virtual sectors).

This is exactly where automatic segment detection can have an impact :

Win98.vmdk
beta2 : 64.95% - 67 MB/s - 176 MB/s
beta3 : 63.86% - 100 MB/s - 250 MB/s

Now that seem correct. Speed improvement is noticeable.
But wait, that's not all, hasn't the compression rate improved also ?
Yes, it has.
The reason for this gain is that, with incompressible segments skipped, the table is less clobbered with useless "noise" pointers, therefore improving compression opportunities for future data.
This is the nice bonus of this strategy.

Skipping data, however, also have its bad effect for compression, such as skipping too much, or not providing enough pointers for future data. As a counter-example, let's compress the "Vampire" game directory.

Vampire
beta2 : 41.13% - 80 MB/s - 185 MB/s
beta3 : 41.43% - 90 MB/s - 200 MB/s

This time it has not worked so well. 0.3% ratio lost. Keeping the table clean was not enough to offset too much data skipped and lessened opportunity to find new matches.

Vampire is, however, quite unusual.
In the vast majority of circumstances, the compression difference is very minor (in the range of 0.05%), and more likely a gain than a loss.
Speedwise however, this is always a gain. Even small segments lost in a large container do provide their share of speed boost.

As a last exemple, let's try an Ubuntu virtual HDD File, as proposed by Bulat for his HFCB benchmark :

VM.DLL
beta2 : 32.15% - 90 MB/s - 195 MB/s
beta3 : 32.13% -100 MB/s - 210 MB/s

This example is typical of what you'll get in most circumstances with many large files. So sounds like a good addition to a fast compressor.

Best Regards

Yann
Admin

Number of posts: 174
Registration date: 2008-05-01

http://phantasie.tonempire.net

Back to top Go down

Re: Etincelle - Fast efficient compression => x3.06 @ 105MB/s

Post  Yann on Fri 9 Apr - 22:01

After programming the "incompressible segment detection & skipping" algorithm, i was asked if Etincelle would still be able to detect two identical segments which are incompressible, such as for example 2 identical zip files.

I said that yes, it should, having the feeling i'm keeping enough information to find such occurrence. That was easy to say. But better find out on real data.

I created a simple Tar file, which consists of twice enwik8 compressed with 7z. This is perfect repetition, at long range (about 30MB away). Therefore, ideal compression rate should be 50%.

Let's test with the "old" version :
beta2 : 100.00% - 49 MB/s - 186 MB/s

That's terrible result, the original Etincelle does not find any correlation !
There is an explanation though : these repetitions are very distant ones, in the range of 30MB away.
As a consequence, the table gets filled with a lot of useless "noise pointers", to the point of forgetting what happened 30MB before.

So now let's test the brand new beta 3 with "incompressible segment detection" :
beta3 : 70.86% - 480 MB/s - 1150 MB/s

That's much better. This time, some correlation was found. In any case, it proves that beta 3 in fact improves chances to find two occurences of the same zip files.
But why not 50% ?

This can be explained too : Etincelle has a limit in the size of repetition sequence it can handle. Therefore, after reaching the maximum repetition size, it has to find again a new match. However, it may not find immediately. Hopefully, some minimum book-keeping ensure that it does get back in sync sometimes later, but in between, it lost quite some opportunities to size down the file.

Hence comes beta 4, a modification of Etincelle with Long-Repetition support. It should really help in this situation.
So let's put it to the test :
beta4 : 50.00% - 570 MB/s - 1300 MB/s

This time, it seems good. Perfect long range repetition.


Beta 4 seems to fullfill its objective, but let's check what happen in more complex situation. I created a new tar, with several already compressed files mixed with normal files. The compressed files first occurrences are intentionally regrouped together at the beginning, to try to trick the "detection & skipping" algorithm. They may repeat later, at large distance.

The resulting file is 126 MB long, and its best theoric compressed size is the sum of compressed parts counted only once, which is 62 258 KB.
Let's compare :
beta2 : 102 546 KB - 60 MB/s - 169 MB/s
beta3 : 69 922 KB - 160 MB/s - 350 MB/s
beta4 : 62 768 KB - 165 MB/s - 355 MB/s

Within close range to theoric minimum. Seems good enough.


What about real files now ?
Long repetition are not common in normal situation.
One important exception is MailBoxes, which tend to be filled with several instances of identical attached files. These files are more often than not Zipped, therefore the new algorithm makes wonder detecting them.
Appart from this situation, expect less impressive results. Having tested several large files, beta4 can only improve compression rate compared to beta3, if only by a few KB.
So it is only beneficial.

beta4 :
http://sd-1.archive-host.com/membres...elle-beta4.zip

Regards

Yann
Admin

Number of posts: 174
Registration date: 2008-05-01

http://phantasie.tonempire.net

Back to top Go down

View previous topic View next topic Back to top


Permissions in this forum:
You cannot reply to topics in this forum