Tuesday, November 12, 2013

Breaking a Compressed Huffman Stream Format

So one of the things I've been working on lately is for this Dreamcast game. Most of the assets for video,audio,etc are standard AFS; many tools exist to rip these - not interesting.

What I'm really interested in are the textures and models - but there's a catch.

For most likely performance reasons, the developer has decided to compress all these assets in a 100 or so MB file - custom format; the top of the file, however, is the first interesting bit:


Making the assumption right off the cuff that this is some kind of archive or VFS due to the requirement of storing multiple files, these values begin to make sense(0x128 is an offset and 0x2A1E4B4 seems to be a size which combined add up to 0x2A1E5DC).

Writing a quick python program:


Yields about 37 smaller files; not very many for a game.

Some of the larger files (> 60K) have the same structure at the top; so it's clear that this format can be nested (ie multi-file archives inside of multi-file archives).

At the end of all this recursion, you get something like this:

For most files, the first 4 bytes are the same - but not all. This rules out a CRC. After digging in the game binary and almost keeling over from having to stop and look up how to read SH4 ASM, it's rather apparent that this is some form of LHA-based compression (Huffman encoding). What's more, it appears to have a 64k sliding window, this points to LH5. Of course, this file is rather usless without a header for any LHA extraction program.

Basically, LHA is huge in Japan and has been for a long time (Windows 7 Japan actually COMES with LHA support). For those of you not familiar with it, the header looks like this:


We can already see a few things - the first byte is a size (the size of the header not including itself), a 1 byte crc for the header itself (rolls over to null), a tag for the compression used (-lh5-), compressed size, decompressed size, file metadata (including the filename) and finally the data itself.

Now, we could simply remake this header and try to slap it on top of our data, but we also don't know where the compressed stream starts (1 byte off could completely screw the whole decompression). In this case, it's best to make a brute force unpacker. 

I've selected a C extension python module called lhafile to hack up and turn into a brute forcer (http://trac.neotitans.net/wiki/lhafile). It consists of a python wrapper that basically figures out the compression type, file size, etc. and passes the stream to a c lib that does the actual decompression (of course).

In the _RealGetContent function, we'll just populate the structure that reading our header SHOULD populate (if it had one):



Other smaller changes include wiping out the CRC16 not matching 0 and the uncompressed size (considering we don't know it).

Now comes the tricky part, we can decompress the chunk, but we have no idea what byte it starts on, therefore, we need to actually do the decompression multiple times while checking to ensure it didn't break. Thankfully, LHA has the ability to fault out if the bit pattern isn't right (which basically means we started on the wrong byte and it can't figure out how to decode the next byte in the sequence). We'll alter the original logic from raising an exception at this point to simply continuing to the next start byte in the sequence.

Basically, our workflow is like this:
1. Read compressed data starting at n.
2.Try to decompress
3.If success, write out to file and go again from n+1 / if not, go to the next iteration anyway.

The last step also involves us resetting the file pointer for the data stream back to the beginning byte, but those are just details. In the end, we want:

- To specify the number of successful decompressions we want.
- To specify how many times we're going to retry the file from the next offset.

Adding some arguments to the runtime and calling with sys.argv[i] will work nicely with that.



Basically, we keep trying until we hit the maximum number of attempts we give it to decompress the stream.
We only keep files that are more than 0 bytes (meaning they actually decoded). I was surprised, there are only two or three that actually get decoded properly.


In the end, one of the smaller files I was working on contained this:

We have plaintext!!! The format surrounding it is odd, however. Maybe it didn't unpack properly? Maybe it's actually compressed further... it's likely.

More as it develops!

No comments:

Post a Comment