For those who don't know, I have recently added LibCompress to the SVN.
LibCompress is an LZW based encryption algorithm. It is fairly fast. (Less than half a second to compress 20KB of data on my crappy computer at work, about a tenth of a second to decompress the result)
I have tried to make the algorithm as efficient as I possibly could. Compression is lossless, and the ratio depends entirely on the quality of the uncompressed data. As long as the input data does not contain "/000" characters, the output will not have any either, making it suitable for use with AceComm-3.0 and other such libraries.
Hmmm, if I remember correctly, the data transfer limit is something like 14kb per so long, so this could be pretty handy for addons that need to transfer a lot of data.
Well, I have written a huffman compress/uncompress algorithm and I think we should merge the different algorithm into the same library.
My huffman algorithm encodes about twice as fast as your LZW algorithm, producing better compression on my test data. Decompression is however about same speed as compression, i.e. somewhat slower than your LZW decompression.
Other algorithms may arise and therefore I suggest we implement a simple standard for compressed data, consisting of a 1 byte header to select compression algorithm and the data following the header byte is passed to the appropriate decompression algorithm.
For compression, we could ask for a specific compression algorithm or have LibCompress try all compression algorithms to see if any of the compression methods can compress the data.
My huffman algorithm will in worst cast produce compressed data 1 byte larger then the uncompressed data. That 1 byte is the header byte. Currently I use byte value 0 to indicate uncompressed data and byte value 1 to indicate huffman compressed data.
Example of huffman-compressing 20000 random byte values ranged 0-24 (25 symbols, actual byte values are insignificant):
My algorithms have been optimized for speed. Thereby not saying further optimizations are impossible. :-P
The is a bug in LibCompress right now:
if (#uncompressed > ressize) or (uncompressed:sub(1,1) == "/001") then
should be:
if (#uncompressed > ressize) or (uncompressed:sub(1,1) == "\001") then
If LibCompress was to have several compression methods, your LZW would require a permanent prefix byte value for compressed data. Right now you use "\001" to indicate that data is compressed but if trying to compress data with a \001 prefixed, you get the compressed data stream, even if it is much larger than the uncompressed data:
Compressing 20001 bytes values 0-24 with an initial byte value \001:
Always using a prefix byte will help solve this problem as you will always be able to flag compressed data properly (and be able to compress data regardless of input byte values).
A last problem is compressing the string "" which will cause an error.
If you prefer not to merge the algorithms, I'll have to figure out a good name (was thinking about LibCompress, but now it is taken. :-P )
I cannot guarantee that the compressed huffman data does not contain \000 byte values. The huffman algorithm does not operate on byte values as such, but packs words of varying bit-length into bytes for storage. This can result in all sorts of byte values. It is up to a channel-encoder to make sure all byte values can be transmitted over a given communication channel or be stored in savedvariables files.
I would guess encoding all 0 byte values would make the data grow by about 1-2% or so.
Perhaps we should add an option to the compress/decompress functions to guarantee comms safety, changing all occurances of \000 and \001 to \001\255 and \001\001 respectively in the compress code, while converting in the reverse direction in decompress. This way, the user of the library doesn't need to worry about this implementation detail.
I don't think this is is what a compression algorithm should do. Some addons have their own channel encoding and us messing with the datastream will not be optimal. Keep the things seperate I say.
I haven't any wowace SVN access yet, but may get it eventually.
honestly the only thing something like this would ever be really used is for communication since memory shouldn't be an issue client side so since we are already messing with strings you might as well finish the job and make them ready for transmission or at least give the option the make them transmittable
hmm just tried it and it didn't do anything to my string
This string is the one i get after using the AceSerializer on my table:
SavedVariables has a limit on size, and there's also a limit to how long a string can be.
I've encountered both problems previously, in extreme cases while developing addons. I've just changed the way an addon works so that instead of taking up so much savedvariable space, it does a lot more computations ingame, leading to a lot higher CPU usage.
hmm just tried it and it didn't do anything to my string
Note that my LZW compressor will only return a compressed value if the compressed form is actually smaller than the given string. This may be why your string isn't getting compressed.
It wasn't compressed because the LZW compression algorithm couldn't compress it. The result of the algorithm was 166 characters longer. Most likely there wasn't enough long chain repetition in the source text (repetition of short two or three character strings is not enough). The LZW algorithm I presented is much better suited for texts that use the same words often.
Also, the LZW algorithm uses a numeric encoding system which is inherently less space efficient than a true dynamic bit width code (i.e. 9, 10, 11, etc bit codes). My encoding system is not nearly as efficient as it could be. Perhaps in the future, when I don't have as much other work to do, I will convert the encoding system to a more appropriate one.
see attached file it contains a table with 3 strings the orginal serialized string the Hufed string and a deHufed string. the decoded one almost matches the original one but all capital A's where changed to 's
The reason was that I was putting too many bits into a variable (35 bits when there was only room for 32). This was a problem in the decompressor only which has been reworked. But the problem can happen in the compressor as well but now I am checking for it (the only place where it can happen I think). If it does happen, the compressor returns the uncompressed string with a header indicating it is uncompressed (can be fed to the decompressor).
I haven't received any response about my SVN query so I'll just make my own library and publish on curse.com. May I include the LZW code and reuse the LibCompress name?
I haven't received any response about my SVN query so I'll just make my own library and publish on curse.com. May I include the LZW code and reuse the LibCompress name?
I'd much rather develop LibCompress on wowace (at least until curseforge gets stable).
I will be happy to commit any changes to LibCompress for you until you get your SVN access. Just send me a PM with the changed files or patches, and a commit message.
I'd much rather develop LibCompress on wowace (at least until curseforge gets stable).
I will be happy to commit any changes to LibCompress for you until you get your SVN access. Just send me a PM with the changed files or patches, and a commit message.
That wont be necessary. It seems I have received my SVN login and can now begin work on LibCompress.
Rollback Post to RevisionRollBack
To post a comment, please login or register a new account.
LibCompress is an LZW based encryption algorithm. It is fairly fast. (Less than half a second to compress 20KB of data on my crappy computer at work, about a tenth of a second to decompress the result)
I have tried to make the algorithm as efficient as I possibly could. Compression is lossless, and the ratio depends entirely on the quality of the uncompressed data. As long as the input data does not contain "/000" characters, the output will not have any either, making it suitable for use with AceComm-3.0 and other such libraries.
My huffman algorithm encodes about twice as fast as your LZW algorithm, producing better compression on my test data. Decompression is however about same speed as compression, i.e. somewhat slower than your LZW decompression.
Other algorithms may arise and therefore I suggest we implement a simple standard for compressed data, consisting of a 1 byte header to select compression algorithm and the data following the header byte is passed to the appropriate decompression algorithm.
For compression, we could ask for a specific compression algorithm or have LibCompress try all compression algorithms to see if any of the compression methods can compress the data.
My huffman algorithm will in worst cast produce compressed data 1 byte larger then the uncompressed data. That 1 byte is the header byte. Currently I use byte value 0 to indicate uncompressed data and byte value 1 to indicate huffman compressed data.
Example of huffman-compressing 20000 random byte values ranged 0-24 (25 symbols, actual byte values are insignificant):
encoding time: 0.032
decoding time: 0.046
Original size: 20000
Compressed size: 11824
Compression ratio: 0.5912
My algorithms have been optimized for speed. Thereby not saying further optimizations are impossible. :-P
The is a bug in LibCompress right now:
if (#uncompressed > ressize) or (uncompressed:sub(1,1) == "/001") then
should be:
if (#uncompressed > ressize) or (uncompressed:sub(1,1) == "\001") then
If LibCompress was to have several compression methods, your LZW would require a permanent prefix byte value for compressed data. Right now you use "\001" to indicate that data is compressed but if trying to compress data with a \001 prefixed, you get the compressed data stream, even if it is much larger than the uncompressed data:
Compressing 20001 bytes values 0-24 with an initial byte value \001:
encoding time: 0.063
decoding time: 0.031
Original size: 20001
Compressed size: 26658
Compression ratio: 1.3328333583321
Always using a prefix byte will help solve this problem as you will always be able to flag compressed data properly (and be able to compress data regardless of input byte values).
A last problem is compressing the string "" which will cause an error.
If you prefer not to merge the algorithms, I'll have to figure out a good name (was thinking about LibCompress, but now it is taken. :-P )
http://pastey.net/87899
\000 - Never used (because communications hates \000 characters)
\001 - Uncompressed
\002 - LZW
\003 - Huffman
Also, if you can get your algorithm to guarantee no \000 characters in compressed output that would be really good for communication uses.
I cannot guarantee that the compressed huffman data does not contain \000 byte values. The huffman algorithm does not operate on byte values as such, but packs words of varying bit-length into bytes for storage. This can result in all sorts of byte values. It is up to a channel-encoder to make sure all byte values can be transmitted over a given communication channel or be stored in savedvariables files.
I would guess encoding all 0 byte values would make the data grow by about 1-2% or so.
Either way it is not too important.
I haven't any wowace SVN access yet, but may get it eventually.
hmm just tried it and it didn't do anything to my string
This string is the one i get after using the AceSerializer on my table:
This string is the return from the above string passed the compression function:
attached is my SV file i dumped the strings to
SavedVariables has a limit on size, and there's also a limit to how long a string can be.
I've encountered both problems previously, in extreme cases while developing addons. I've just changed the way an addon works so that instead of taking up so much savedvariable space, it does a lot more computations ingame, leading to a lot higher CPU usage.
This library could help both those cases.
Note that my LZW compressor will only return a compressed value if the compressed form is actually smaller than the given string. This may be why your string isn't getting compressed.
jjsheets anyway to maybe return an extra value if the original was smaller than the 'compressed' version? maybe
also any idea why that string wasn't compressed?
It wasn't compressed because the LZW compression algorithm couldn't compress it. The result of the algorithm was 166 characters longer. Most likely there wasn't enough long chain repetition in the source text (repetition of short two or three character strings is not enough). The LZW algorithm I presented is much better suited for texts that use the same words often.
Also, the LZW algorithm uses a numeric encoding system which is inherently less space efficient than a true dynamic bit width code (i.e. 9, 10, 11, etc bit codes). My encoding system is not nearly as efficient as it could be. Perhaps in the future, when I don't have as much other work to do, I will convert the encoding system to a more appropriate one.
works like a charm (most of the time)
see attached file it contains a table with 3 strings the orginal serialized string the Hufed string and a deHufed string. the decoded one almost matches the original one but all capital A's where changed to 's
http://pastey.net/87960
The reason was that I was putting too many bits into a variable (35 bits when there was only room for 32). This was a problem in the decompressor only which has been reworked. But the problem can happen in the compressor as well but now I am checking for it (the only place where it can happen I think). If it does happen, the compressor returns the uncompressed string with a header indicating it is uncompressed (can be fed to the decompressor).
I'd much rather develop LibCompress on wowace (at least until curseforge gets stable).
I will be happy to commit any changes to LibCompress for you until you get your SVN access. Just send me a PM with the changed files or patches, and a commit message.
That wont be necessary. It seems I have received my SVN login and can now begin work on LibCompress.