For those who don't know, I have recently added LibCompress to the SVN.
LibCompress is an LZW based encryption algorithm. It is fairly fast. (Less than half a second to compress 20KB of data on my crappy computer at work, about a tenth of a second to decompress the result)
I have tried to make the algorithm as efficient as I possibly could. Compression is lossless, and the ratio depends entirely on the quality of the uncompressed data. As long as the input data does not contain "/000" characters, the output will not have any either, making it suitable for use with AceComm-3.0 and other such libraries.
Well, I have written a huffman compress/uncompress algorithm and I think we should merge the different algorithm into the same library.
My huffman algorithm encodes about twice as fast as your LZW algorithm, producing better compression on my test data. Decompression is however about same speed as compression, i.e. somewhat slower than your LZW decompression.
Other algorithms may arise and therefore I suggest we implement a simple standard for compressed data, consisting of a 1 byte header to select compression algorithm and the data following the header byte is passed to the appropriate decompression algorithm.
For compression, we could ask for a specific compression algorithm or have LibCompress try all compression algorithms to see if any of the compression methods can compress the data.
My huffman algorithm will in worst cast produce compressed data 1 byte larger then the uncompressed data. That 1 byte is the header byte. Currently I use byte value 0 to indicate uncompressed data and byte value 1 to indicate huffman compressed data.
Example of huffman-compressing 20000 random byte values ranged 0-24 (25 symbols, actual byte values are insignificant):
My algorithms have been optimized for speed. Thereby not saying further optimizations are impossible. :-P
The is a bug in LibCompress right now:
if (#uncompressed > ressize) or (uncompressed:sub(1,1) == "/001") then
if (#uncompressed > ressize) or (uncompressed:sub(1,1) == "\001") then
If LibCompress was to have several compression methods, your LZW would require a permanent prefix byte value for compressed data. Right now you use "\001" to indicate that data is compressed but if trying to compress data with a \001 prefixed, you get the compressed data stream, even if it is much larger than the uncompressed data:
Compressing 20001 bytes values 0-24 with an initial byte value \001:
I cannot guarantee that the compressed huffman data does not contain \000 byte values. The huffman algorithm does not operate on byte values as such, but packs words of varying bit-length into bytes for storage. This can result in all sorts of byte values. It is up to a channel-encoder to make sure all byte values can be transmitted over a given communication channel or be stored in savedvariables files.
I would guess encoding all 0 byte values would make the data grow by about 1-2% or so.
Perhaps we should add an option to the compress/decompress functions to guarantee comms safety, changing all occurances of \000 and \001 to \001\255 and \001\001 respectively in the compress code, while converting in the reverse direction in decompress. This way, the user of the library doesn't need to worry about this implementation detail.
honestly the only thing something like this would ever be really used is for communication since memory shouldn't be an issue client side so since we are already messing with strings you might as well finish the job and make them ready for transmission or at least give the option the make them transmittable
hmm just tried it and it didn't do anything to my string
This string is the one i get after using the AceSerializer on my table:
SavedVariables has a limit on size, and there's also a limit to how long a string can be.
I've encountered both problems previously, in extreme cases while developing addons. I've just changed the way an addon works so that instead of taking up so much savedvariable space, it does a lot more computations ingame, leading to a lot higher CPU usage.
It wasn't compressed because the LZW compression algorithm couldn't compress it. The result of the algorithm was 166 characters longer. Most likely there wasn't enough long chain repetition in the source text (repetition of short two or three character strings is not enough). The LZW algorithm I presented is much better suited for texts that use the same words often.
Also, the LZW algorithm uses a numeric encoding system which is inherently less space efficient than a true dynamic bit width code (i.e. 9, 10, 11, etc bit codes). My encoding system is not nearly as efficient as it could be. Perhaps in the future, when I don't have as much other work to do, I will convert the encoding system to a more appropriate one.
see attached file it contains a table with 3 strings the orginal serialized string the Hufed string and a deHufed string. the decoded one almost matches the original one but all capital A's where changed to 's
The reason was that I was putting too many bits into a variable (35 bits when there was only room for 32). This was a problem in the decompressor only which has been reworked. But the problem can happen in the compressor as well but now I am checking for it (the only place where it can happen I think). If it does happen, the compressor returns the uncompressed string with a header indicating it is uncompressed (can be fed to the decompressor).