Reverse engineering an unknown compression file format
This challenge was solved during the Santhacklaus2019 ctf and is one of the most complete chall I’ve ever encountered, as it requires to have a good understanding of: - c++ reverse engineering - scripting - file format / steganography - cryptography
First contact
The description of the challenge lets us know that the goal is to reverse engineer a custom file format (NSAR) used to compress and encrypt multiple files. We download the two files:
nsar
It is the program used to compress and encrypt multiple files to a .nsar
archive. We can use it pretty easily:
➜ ./nsar
Enter path to files (max 255 files):
> ./toast.pdf
> ./blackhole.gif
>
2 files selected.
Key for encryption (empty = no encryption): MYKEY42
Encryption
Output: out.nsar
➜ hexdump -C out.nsar | head
00000000 4e 53 41 52 00 01 01 02 00 00 00 00 00 00 00 00 |NSAR............|
00000010 62 6c 61 63 6b 68 6f 6c 65 2e 67 69 66 00 30 00 |blackhole.gif.0.|
00000020 00 00 74 6f 61 73 74 2e 70 64 66 00 96 76 69 00 |..toast.pdf..vi.|
00000030 1f 8b 08 00 00 00 00 00 00 03 00 1b 40 e4 bf 0a |............@...|
00000040 10 0d 7d 60 55 22 49 01 49 98 00 00 00 00 00 40 |..}`U"I.I......@|
00000050 00 00 22 33 00 7b 4a 00 77 36 00 74 5b 00 06 5a |.."3.{J.w6.t[..Z|
00000060 00 7f 48 00 1c 43 00 54 35 00 31 42 00 28 3e 00 |..H..C.T5.1B.(>.|
00000070 34 55 00 c4 54 00 ba 42 00 c4 55 00 a2 20 00 c4 |4U..T..B..U.. ..|
00000080 58 00 fa 20 00 e4 4c 00 f4 4e 35 89 55 58 8d 5e |X.. ..L..N5.UX.^|
00000090 5b 80 13 49 95 6f 47 66 12 28 8f 71 48 29 73 21 |[..I.oGf.(.qH)s!|
archive.nsar
An NSAR archive created with the above program by the creator of the challenge. We can see that a lot of files in various formats have been put in this archive. One of them surely contains the flag!
➜ hexdump -C archive.nsar | head -n 20
00000000 4e 53 41 52 00 01 01 1c 00 00 00 00 00 00 00 00 |NSAR............|
00000010 61 2e 6f 75 74 00 7f 01 00 00 62 62 62 2e 6d 70 |a.out.....bbb.mp|
00000020 34 00 2b 3e 00 00 63 64 2e 67 69 66 00 79 98 0f |4.+>..cd.gif.y..|
00000030 00 64 6c 70 6d 6a 63 63 2e 77 6d 76 00 52 76 30 |.dlpmjcc.wmv.Rv0|
00000040 00 65 2e 67 69 66 00 71 fe 63 00 66 6c 61 67 2e |.e.gif.q.c.flag.|
00000050 6f 64 74 00 01 b7 64 00 66 72 2e 67 69 66 00 00 |odt...d.fr.gif..|
00000060 30 65 00 68 2e 67 69 66 00 6a 6d 6b 00 68 64 75 |0e.h.gif.jmk.hdu|
00000070 2e 70 6e 67 00 a2 09 6d 00 6a 2e 67 69 66 00 56 |.png...m.j.gif.V|
00000080 54 6f 00 6d 61 72 69 6f 2e 67 69 66 00 ed 03 70 |To.mario.gif...p|
00000090 00 6e 6f 74 68 69 6e 67 2e 70 6e 67 00 7e 28 72 |.nothing.png.~(r|
000000a0 00 6f 6b 2e 6d 70 33 00 69 2f 72 00 6f 6b 62 72 |.ok.mp3.i/r.okbr|
000000b0 6f 2e 70 6e 67 00 e1 bf 75 00 70 61 70 61 6e 6f |o.png...u.papano|
000000c0 65 6c 2e 70 6e 67 00 ed 78 76 00 72 65 61 6c 5f |el.png..xv.real_|
000000d0 73 61 6e 74 61 2e 70 6e 67 00 5f cd 7b 00 73 2e |santa.png._.{.s.|
000000e0 67 69 66 00 26 10 7f 00 73 61 64 2e 67 69 66 00 |gif.&...sad.gif.|
000000f0 5a ab 80 00 73 61 6e 74 61 2e 62 6d 70 00 71 fd |Z...santa.bmp.q.|
00000100 e3 00 73 61 6e 74 61 2e 67 69 66 00 a7 bc fa 00 |..santa.gif.....|
00000110 73 61 6e 74 61 66 6c 61 67 2e 70 64 66 00 80 28 |santaflag.pdf..(|
00000120 12 01 73 61 6e 74 61 7a 69 63 2e 6d 70 33 00 f0 |..santazic.mp3..|
00000130 60 17 01 73 65 63 72 65 74 2e 6a 70 67 00 07 a4 |`..secret.jpg...|
Reversing the NSAR file format
To extract the files stored in the NSAR archive, we need to understand how this file format works. To do so, we start reversing the NSAR program. First let’s have an overview of the binary:
➜ file nsar
nsar: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=efd6544321a3657b6e164861d1b1cb319c1ad3d6, for GNU/Linux 3.2.0, not stripped
➜ rabin2 -I nsar
arch x86
baddr 0x0
binsz 61416
bintype elf
bits 64
canary false
class ELF64
compiler GCC: (Debian 9.2.1-19) 9.2.1 20191109
crypto false
endian little
havecode true
intrp /lib64/ld-linux-x86-64.so.2
laddr 0x0
lang c++
...
It’s an ELF 64 bits binary coded in C++ for the x86 architecture. It’s now time to launch IDA.
The main function of the program consists of the multiple input prompts we saw earlier. The most important line is the one that comes near the end:
create_nsar((__int64)&OUTPUT_FILENAME, (__int64)&ENCRYPTION_KEY, (__int64)&FILEPATHES_TO_COMPRESS);
create_nsar
starts by defining two double-ended queues, which are basically queues with two edges: you can get, push or pop stuff both at the front and at the end of these queues.
std::deque<std::fpos<__mbstate_t>,std::allocator<std::fpos<__mbstate_t>>>::deque(&QUEUE_DOUBLE_E_1);
std::deque<std::fpos<__mbstate_t>,std::allocator<std::fpos<__mbstate_t>>>::deque(&QUEUE_DOUBLE_E_2);
We can divide the rest of the function in three main parts…
Creation of the nsar header
v29 = 78; // 'N'
v30 = 83; // 'S'
v31 = 65; // 'A'
v32 = 82; // 'R'// non relevant code ellapsedstd::basic_ofstream<char,std::char_traits<char>>::basic_ofstream((__int64)&OSSTREAM, OUTPUT_FILENAME, v3);
std::ostream::write((std::ostream *)&OSSTREAM, &v29, 16LL);
Those few lines show us the beginning of the NSAR header: the bytes NSAR
are written, followed by 12 NULL bytes (because the write() is done on 16 bytes). Note that here, OSSTREAM
refers to the output file stream.
Then the program iterates over all input files:
while ( (unsigned __int8)std::operator!=(&beginning, &end) )
{
// 1
first_element = std::_Rb_tree_const_iterator<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>::operator*(&beginning);
extract_filename_from_path(&FILENAME, first_element);
string = (char *)std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>::c_str(&FILENAME);
len = std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>::length(&FILENAME);
// 2std::ostream::write((std::ostream *)&OSSTREAM, string, len + 1);
// 3
position = std::ostream::tellp((std::ostream *)&OSSTREAM);
std::deque<std::fpos<__mbstate_t>,std::allocator<std::fpos<__mbstate_t>>>::push_back(&QUEUE_DOUBLE_E_1, &position);
// 4
number_wtf = 1061109567;
std::ostream::write((std::ostream *)&OSSTREAM, (const char *)&number_wtf, 4LL);
v41 = v5;
std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>::~basic_string(&FILENAME);
std::_Rb_tree_const_iterator<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>::operator++(&beginning);
}
For each file:
- Get the filename
- Write this name in the output file + 1 NULL byte (
len + 1
) - Save the current position in the output buffer in the first deque
QUEUE_DOUBLE_E_1
- Write an integer (4 bytes) to the output file
We can know start to tell the format of the header:
NSAR + 0x0 * 12 + filename1 + 0x0 + 4bytesInteger + filename2 + 0x0 + 4bytesInteger + ...
After this loop, the output file stream is closed. We are now going to see how the body of the file is defined.
Creation of the nsar body
The output file is then re-opened with gzopen
, this tells us that in the rest of the function, we will interact with the output file as a GZ compressed file.
In this part, the program iterates once again over the input files:
while ( (unsigned __int8)std::operator!=(&beginning2, &end2) )
{
// 1
first_element_2 = std::_Rb_tree_const_iterator<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>::operator*(&beginning2);
std::basic_ifstream<char,std::char_traits<char>>::basic_ifstream(&FILENAME, first_element_2, 4LL);// set current FILENAME// 2std::fpos<__mbstate_t>::fpos(&pos_actuelle, long_position);
std::deque<std::fpos<__mbstate_t>,std::allocator<std::fpos<__mbstate_t>>>::push_back(
&QUEUE_DOUBLE_E_2,
&pos_actuelle);
if ( (unsigned __int8)std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>::empty(ENCRYPTION_KEY) != 1 )
{
// 3
file_content = std::basic_ifstream<char,std::char_traits<char>>::rdbuf(&FILENAME);
key = std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>::c_str(ENCRYPTION_KEY);
key_length = std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>::length(ENCRYPTION_KEY);
do
{
charr = std::basic_streambuf<char,std::char_traits<char>>::sgetc(file_content);
x = 0;
if ( charr )
{
x = charr ^ *(_BYTE *)(index % key_length + key);// <=> charr ^ key[index % len(key)]if ( !x )
x = charr;
}
else
{
x = 0;
}
// 4
nb_written_should_be_one = gzwrite(gzFile, &x, 1LL);
long_position += nb_written_should_be_one;
++index;
}
while ( (unsigned int)std::basic_streambuf<char,std::char_traits<char>>::snextc(file_content) != -1 ); // while we're not at the end of the input file
}
}
Some explanations:
- Open the current file
- Push current position in output file to the second deque
- Here is the code defining the encryption process of the input file. It is a repeated XOR encryption, with a few exceptions:
- If the byte of the input file we currently want to XOR is null, we do not XOR it
- If the result of the XOR is null, we do not XOR the current byte of the input file
- Write the encrypted byte with
gzwrite
What does gzwrite
do ?
At first I thought that it wouldn’t compress anything because we use gzwrite
to write one byte at the time. To check if my assumption was correct, I created a new NSAR archive containing only a toast.pdf
file. Thanks to what we’ve learnt so far, we can tell the length of the corresponding archive header: len("NSAR") + 12 + len("toast.pdf") + 1 + 4 = 30
➜ tail -c +31 out.nsar | hexdump -C | head
00000000 1f 8b 08 00 00 00 00 00 00 03 8c ba 63 90 2e 3c |............c..<|
00000010 d0 36 38 33 67 6c db e6 3d b6 6d db b6 6d db 3a |.683gl..=.m..m.:|
00000020 63 db b6 6d 7b ce d8 b6 3d fb bc fb bd 5b 5b b5 |c..m{...=....[[.|
We can see that the body of the NSAR file starts with 1f 8b 08
which are the first bytes of the gzip file format.
So it appear that gzwrite
is using a buffer to store all the written input and wait for gzclose
to compress it and write it to the output file.
Final header
At the end of the function, we see this loop which iterates over all input filenames:
while ( (unsigned __int8)std::operator!=(&beg, &endd) )
{
// 1
front_header = (__int64 *)std::deque<std::fpos<__mbstate_t>,std::allocator<std::fpos<__mbstate_t>>>::front(&QUEUE_DOUBLE_E_1);
std::ostream::seekp((__int64)&OSSTREAM, *front_header, front_header[1]);
std::deque<std::fpos<__mbstate_t>,std::allocator<std::fpos<__mbstate_t>>>::pop_front(&QUEUE_DOUBLE_E_1);
// 2
v13 = (const char *)std::deque<std::fpos<__mbstate_t>,std::allocator<std::fpos<__mbstate_t>>>::front(&QUEUE_DOUBLE_E_2);
std::ostream::write((std::ostream *)&OSSTREAM, v13, 4LL);
std::deque<std::fpos<__mbstate_t>,std::allocator<std::fpos<__mbstate_t>>>::pop_front(&QUEUE_DOUBLE_E_2);
std::_Rb_tree_const_iterator<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>::operator++(&beg);
}
- take the first value of the FIRST deque (which points to the four byte integer after a filename in NSAR header), go to it in the output file and pop the value from the deque.
- take the first value of the SECOND deque (which points to the beginning of the corresponding file content before compression) and stores it in the current position in the output buffer.
And tada! We now know how the NSAR file format works:
NSAR FILE FORMAT
*=======================HEADER PART=======================*
NSAR + 0x0 * 12 + filename1 + 0 + offset1 + filename2 + 0
+ offset2 + filename3 + 0 + offset3 + ...
*=======================BODY PART=========================*
---------------------GZIP COMPRESSED----------------------
contentOfFile1 + contentOfFile2 + contentOfFile3 + ...
----------------------------------------------------------
*=========================================================*
Where offset{n}
is the position of the content of the nth file content in the uncompressed body.
Creating un-NSAR script
Now let’s make a Python script that properly extracts and unciphers the files contained in the archive.
Reading NSAR header
We retrieve the list of the files in the archive with their offset:
def get_fileinfos(content: bytes) -> Tuple[str, int]:
"""
reads the first filename and corresponding offset found in content
"""
index = 0
filename = ""
while content[index] != 0:
filename += chr(content[index])
index += 1
index += 1
tmp = []
for _ in range(4):
tmp.append(content[index])
index += 1
offset = int(tmp[3] << 24 | tmp[2] << 16 | tmp[1] << 8 | tmp[0]) # little endian <3return filename, offset
We can now open the file and read all header:
with open("/tmp/archive.nsar", "rb") as f:
content = f.read()
c = content[16:] # skip FORMAT name (NSAR) and padding
files = {}
ex_filename = None
while True:
filename, offset = get_fileinfos(c)
if offset == 0: # end of headerbreak
files[filename] = {"offset": offset}
if ex_filename:
files[ex_filename]["length"] = files[filename].get("offset") - file[ex_filename].get("offset")
ex_filename = filename
c = c[len(filename) + 5:] # go to next file in header
We now have a dictionary files
containing all archived files, their offset and their length in uncompressed body.
Uncompress the body
To retrieve the file, we need to uncompress the gzip-compressed body. This can be done easily with the gzip module:
import gzip
with open("/tmp/tmp.txt", "wb") as f:
f.write(c) # write only the compressed bodywith gzip.open("/tmp/tmp.txt", "rb") as f:
c = f.read() # c now contains the uncompressed body
header_size = len(content) - len(c)
content = content[:header_size] + c
Getting the key
We know that the body is XORed repetively with a key given by the chall creator. And we also know that if a ^ b = c
, then a ^ c = b
Thanks to that property, we will be able to retrieve the key because we know a part of the plaintext that has been encrypted: the first bytes of the header of the png format for example.
To get the key, we will XOR the first bytes of the encrypted png files with the first bytes of png header. We will then be able to retrieve parts of the keys (as it is a repetitive xor) and finally get the complete key.
Here is a function to check the part of key used at the beginning of an encrypted png:
def guess_png_key(files: dict, content: bytes, filename: str) -> None:
offset = files[filename].get("offset")
print(f"\n[*] trying to guess key at offset {offset} with {filename}")
header = b"\x89PNG\x0d\x0a\x1a\x0a\x00\x00\x00\x0dIHDR\x00\x00\x00" # mostcommon first fixed bytes in png headers
guessed_key = ""
for nb in range(len(header)):
x = content[offset + nb] ^ header[nb]
# don't forget the conditions we saw in the cpp code:if x == 0:
guessed_key += chr(content[offset + nb])
else:
guessed_key += chr(x)
print(f"Possible part of key {guessed_key}")
guess_png_key(files, content, "okbro.png")
guess_png_key(files, content, "papanoel.png")
guess_png_key(files, content, "hdu.png")
guess_png_key(files, content, "yu.png")
guess_png_key(files, content, "nothing.png")
Which outputs:
[*] trying to guess key at offset 7716833 with okbro.png
Possible part of key Up3R_Sec_KEYT
[*] trying to guess key at offset 7764205 with papanoel.png
Possible part of key YTh1s_iS_sUp3P
[*] trying to guess key at offset 7145890 with hdu.png
Possible part of key sUp3R_Se7_KEY0
[*] trying to guess key at offset 23107299 with yu.png
Possible part of key cr37_KEYs_iS_^
[*] trying to guess key at offset 7481470 with nothing.png
Possible part of key My_sUp3Rcr37_X
We can know assemble the complete key used for encryption: Th1s_iS_My_sUp3R_Secr37_KEY
.
Extract files
Now, all we need is to uncipher the body and cut it in different files to retrieve the original input files:
import os
def decrypt_all(files: dict, content: bytes, key: str) -> None:
directory = f"_extracted_{sys.argv[1]}"
os.mkdir(directory)
j = 0
for file in files:
with open(f"{directory}/{file}", "wb") as f:
offset = files[file].get("offset")
length = files[file].get("length")
print(f"[*] inflating {directory}/{file}")
i = 0
while i < length:
x = content[offset + i] ^ ord(key[j % len(key)])
if x == 0:
x = content[offset + i]
if content[offset + i] == 0:
x = 0
f.write(bytes([x]))
i += 1
j += 1
Then:
➜ _extracted_archive.nsar ls
a.out e.gif hdu.png ok.mp3 s.gif santaflag.pdf ttt.gif
bbb.mp4 flag.odt j.gif okbro.png sad.gif santazic.mp3 twerk.gif
cd.gif fr.gif mario.gif papanoel.png santa.bmp secret.jpg udt.jpg
dlpmjcc.wmv h.gif nothing.png real_santa.png santa.gif tkt.avi yu.png
The flag can now be retrieved from santaflag.pdf
: