## Exploiting logical bug to solve NP-complete reversing puzzle | Diplodocus @ FCSC 2022

# Intro

Diplodocus is I think by far my favorite challenge of the 2022 edition of the FCSC. First it is a reverse challenge, my reference category, then it does not feature weird unknown CPU architecture, just plain x64 with the only layer of obfuscation being the heavy optimization of the compiler. Finally it is an algorithmic problem, and I love algorithms.

The complexity of this challenge relies on the underlying problem it implements. Diplodocus is the successor of Triceratops, a similar challenge from the 2021 edition of the FCSC where you were also asked to solve an NP-complete puzzle to get the flag.

The twist here is that a conceptual flaw of the program allows us to recover the flag without actually solving the intended puzzle, but more on that later.

## Challenge description

`reverse`

| `477 pts`

`10 solves`

`:star::star::star:`

```
Trouvez une entrée qui valide, et soumettez-la au service en ligne pour obtenir
le flag.
`nc challenges.france-cybersecurity-challenge.fr 2201`
SHA256(`diplodocus`) =
`9af3062da630d2b94ad3bfa0b5fd67328d2c6c7bbb79607d7d2fa28a67c7ff9c`.
```

Author: `\J`

## Given files

# Writeup

## Overview

As stated the program is simply an x64 stripped ELF.

```
$ file diplodocus
diplodocus: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV),
dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2,
BuildID[sha1]=c489721789271e74d301cda200feb877bd22d80a, for GNU/Linux 3.2.0,
stripped
```

If we try to execute the program, it reads a line from standard input and exits with status code 1.

```
$ ./diplodocus
input
$ echo $status
1
```

stracing and ltracing doesn’t apport much more information, the program indeed
only calls `read(2)`

.

## Main

It’s now time to open our favorite decompiler and investigate the main function.

After cleaning up the decompiler output and renaming the variables we get the following code:

```
int32_t main(int32_t argc, char** argv, char** envp)
{
char input[0x400];
ssize_t len = read(0, &input, 0x400);
if (len <= 0)
{
puts("Error.");
exit(1);
/* no return */
}
int32_t res = check(&input, len);
if (res == 0)
{
FILE* file = fopen("flag.txt", &r);
if (file == 0)
{
puts("Well done! You can submit your i…");
exit(1);
/* no return */
}
char flag[0x46];
if (fread(&flag, 1, 0x46, file) != 0)
{
__printf_chk(1, "Well done, the flag is %s\n", &flag);
}
fclose(file);
}
return res;
}
```

The code is really simple here, it reads the standard input and pass the input to a function that seems to perform a check.

If the check is sucessful, the program prints the flag, otherwise it exits with 1.

## Check

The Check function starts by setting up some local variables like a pointer to the start and to the end of the input and what seems to be a struct that I called context that may contain informations relevant to the ongoing puzzle. Most of this struct is set to 0:

```
uint64_t check(char* input, int64_t len)
{
char* cursor = input;
void* end = &cursor[len];
void* fsbase;
struct context context;
int64_t* context_ptr = &context;
for (int64_t i = 0x14; i != 0; i = (i - 1))
{
*(int64_t*)context_ptr = 0;
context_ptr = &context_ptr[1];
}
context.__offset0x8_q = 0x100ffffff;
*(int32_t*)context_ptr = 0;
...
}
```

## Instruction fetching and dispatch

My first observation of this main function made me think it was a really small virtual machine. We can see it goes through every character in the input and matches it against opcodes between 0 and 4 included. Every other values seems to end the main loop with a return value of 1, indicating an error.

Since I see that the return value is set by oring a field of the context (which I assumed at that time was a processor structure) I assume that this field holds some sort of error flags that will invalidate our puzzle.

We then have the main dispatcher, processing the correct instruction. One thing
that can be noted is the two `break`

statement in `case 0`

, this is obviously
my decompiler trolling me but what he is trying to say is that this case will
end the main loop as well. It thus probably is the only way to end the function
with a return value of 0. It must then be the instruction performing the final
check after executing all our instructions. We will take a look at it last
since other instructions will give us a better idea of the internal structure
of the context and are easier to reverse.

```
uint64_t check(char* input, int64_t len)
{
...
uint64_t res;
while (cursor < end)
{
char* next = &cursor[1];
if (*(int8_t*)cursor > 4)
{
res = ((uint64_t)(context.err | 1));
break;
}
switch (*(int8_t*)cursor)
{
case 0:
{
...
break;
break;
}
case 1:
{
...
break;
}
case 2:
{
...
break;
}
case 3:
{
...
break;
}
case 4:
{
...
break;
}
}
cursor = next;
}
if ((cursor >= end || (cursor < end && *(int8_t*)cursor <= 4)))
{
res = 1;
}
return res;
}
```

### Case 1

This instruction is really simple, it makes one field of the context cycle
between 0 and 3 included. Let’s simply remember the offset of this field
(`0xa0`

) for now to see where it is used.

```
case 1:
{
context.__offset0xa0_d = ((context.__offset0xa0_d + 1) & 3);
break;
}
```

### Case 2

Well we did not need to wait a long time to see where the field from case 1 is used. Right at the beginning of case 2, the program matches the value of the variable against all its possible values, it then creates a copy of two other variable of the context and update them depending on our matching.

The 2 variables are then bound checked against `0xb`

and used to index what
seems to be an array of `int32_t`

. We can therefore understand that the two
variables are indexes of this array and that the variable from case 1 is an
enum describing possible movements in this array. I therefore name them `move`

,
`i`

and `j`

.

Next we notice that we check that the `jth`

bit of the `ith`

row of the array,
if is not set, it sets the said bit and update the coordinates in the context.
If however it is already set, the program sets the error flag that we already
identified earlier.

Clearly this is some sort of 12 * 12 bitboard implementation, the first thing that came to my mind is chess since I’m familiar with bitboard based chess engines programming so it was at this exact moment that I understood this problem was probably a puzzle or some sort of board game.

So to rephrase all this information, this instruction changes our coordinates on a 12 * 12 board based on a preselected move from case 1, it places a marker at our new coordinates and errors if we already visited this tile.

I now know one rule of the game: I cannot step twice on the same tile. By lack
of inspiration, I name this bitboard `visited`

since it keeps track of all my
visited tiles.

Bellow is the cleaned up pseudo-C code from the decompiler with all variables renamed accordingly.

```
case 2:
{
enum moves move = context.move; //__offset0xa0
int32_t j = context.j;
int32_t i = context.i;
if (move == 0)
{
j = (j + 1);
}
else if (move == 1)
{
i = (i - 1);
}
else if (move == 2)
{
j = (j - 1);
}
else
{
op = op == 3;
i = (i + ((uint32_t)op));
}
int32_t row;
if ((j <= 0xb && i <= 0xb))
{
row = context.visited[i];
}
if (((j > 0xb || (j <= 0xb && i > 0xb)) || ((j <= 0xb && i <= 0xb) && (TEST_BITD(row, j)))))
{
context.err = (context.err | 1);
}
if (((j <= 0xb && i <= 0xb) && (!(TEST_BITD(row, j)))))
{
context.i = i;
context.j = j;
context.visited[i] = (row | ((int32_t)(1 << j))); // set bit
}
break;
}
```

And here is the moves enum to understand better how case 1 manipulates our moves.

```
enum moves : uint32_t
{
NEXT_COL = 0x0,
PREV_ROW = 0x1,
PREV_COL = 0x2,
NEXT_ROW = 0x3
};
```

### Case 3

Case 3 uses a variable from the context that we previously saw initialized to
`0xffffff`

at the very start of the check function without understanding it.

We can now see that it is in fact a queue of 24 bits that is popped in case 3 to place the said bit on a different bitboard than case 2 but with the same coordinates. If the bit queue is empty then we output an error.

So I understand here that we must place 24 bits on another board while simultaneously moving on the first one that tracks the tiles we already visited.

Seems easy enough, however, the instruction does not end here, it calls a
function passing the row of the bitboard as parameter, if the return value of
this function is more than `2`

we output an error.

```
case 3:
{
int32_t bits_queue = context.bits_queue; //offset0x8_q
int64_t i = ((int64_t)context.i);
int32_t bits_cleared;
bits_cleared = bits_queue == 0;
int32_t new_flags = (bits_cleared | context.err);
uint64_t row = ((uint64_t)(((bits_queue & 1) << ((int8_t)context.j)) | context.board[i])); // set bit
context.bits_queue = (bits_queue >> 1); // pop bit queue
context.board[i] = row; // save new board with new bit
int32_t sub_res;
sub_res = sub_1c60(row);
sub_res = sub_res > 2;
context.err = (new_flags | ((uint32_t)sub_res));
break;
}
```

#### Pop count

Here is the said function, so it seems to perform magic bitwise operation. Could be anything really, my intuition tells me it count the number of bits set on the line but it may as well be some weird bitboard magic, I mean you never know the stuff I found on chess programming while working on chess engines really blows my mind.

By googling the first constant `0x5555555555555555`

I confirm really fast my
intuition. It is indeed a population count function, the fifth result of the
search being, guess what? Chess programming wiki on population
count. I now really start
thinking that this is actually a chess problem.

```
uint64_t pop_count(int64_t row)
{
int64_t rdi = (row - ((row >> 1) & 0x5555555555555555));
int64_t rdi_3 = (((rdi >> 2) & 0x3333333333333333) + (rdi & 0x3333333333333333));
return (((((rdi_3 >> 4) + rdi_3) & 0xf0f0f0f0f0f0f0f) * 0x101010101010101) >> 0x38);
}
```

With this information we can now rename the variables and symbols from case 3 and we now understand that we place a bit from the bit queue on the board, we cannot allign more than 2 bits on the same line.

```
case 3:
{
int32_t bits_queue = context.bits_queue;
int64_t i = ((int64_t)context.i);
int32_t bits_cleared;
bits_cleared = bits_queue == 0;
int32_t new_flags = (bits_cleared | context.err);
uint64_t row = ((uint64_t)(((bits_queue & 1) << ((int8_t)context.j)) | context.board[i])); // set bit
context.bits_queue = (bits_queue >> 1); // pop bit queue
context.board[i] = row; // save new board with new bit
int32_t count;
count = pop_count(row);
count = count > 2;
context.err = (new_flags | ((uint32_t)count));
break;
}
```

### Case 4

Do not be misled, this instruction seems simple enough but is by far the hardest one of the challenge.

So first this instruction fetch 2 operand, these operands are later used to
access a third bitboard, so I name them accordingly `i`

and `j`

.

Then we check if our bit queue is empty, if it is not then we output an error. This mean that we can call this instruction only when we placed all the 24 points at our disposition.

Now comes the fun part, we call a function passing the coordinates and the whole context as parameter, this function will output a bit that will be placed on a third bitboard at the coordinates indexed by the operands.

So my decompiler really despise this function, I told you it took the whole
context as parameters but its not exactly true as it actually packs the whole
context in 10 `int128_t`

arguments using SIMD instructions. I cleaned up the
function call for your eyes so you do not have de keep track of the 12
parameters of the function.

```
case 4:
{
int32_t j = ((int32_t)cursor[1]);
int64_t i = ((int64_t)cursor[2]);
int32_t bits_set;
bits_set = context.bits_queue != 0;
context.err = (context.err | bits_set);
next = &cursor[3];
int32_t bit = sub_1570(j, i, context);
context.third_board[i] = (context.third_board[i] | (bit << j));
break;
}
```

#### \J dabbing on me

You can see below a decompilation of the function, looks fun right? And this is actually cleaned up so you don’t see the SIMD registers and weird struct packing. Now I really recommend you to not actually read this code but to get the main idea from the comments and my explanations. A much more readable python reimplementation is available right after for your sanity.

Most variables are not renamed correctly because I did not actually reverse and understood all this function. This is simply what it looked like in my decompiler at the moment I undestood enough to throw it by the window, modulo of course SIMD and stack fengshui 👀.

Let’s not care too much about weird constants and implementation details for now as I did not understood them at the time either.

The main idea of this is code is that it will run through all possible increments combinations that will iterate on the second board (the one from case 3, where we manually place our bits from the bit queue).

Basically, we check every possible way we can run through the board.

For each of these increment combination we see a weird loop performing modulos of the increments. This is actually the euclidean algorithm that computes the GCD of 2 numbers. This GCD is compared to 1. If it is not one then we skip this increments combination and go to the next loop iteration. We are therefore only concerned about running through the board using coprime increments, confusing I know.

We then iterate on the board using these increments and count the number of
bits set while doing so. If at any point we encounter more than 2 points during
the same passing of the board we set a result flag. The function will output
`1`

if and only if no result flag was set for any iteration. Meaning for any
traversal of the board, we did not encounter more than 2 bits.

Here I was really scared because I started to think that this was maybe `\J`

trolling me with again a cryptography problem modelised using bitboards are
something. And everyone who checks my results of the FCSC know how bad I am at
cryptography.

```
uint64_t sub_1570(int32_t j, int32_t i, struct context context)
{
int32_t res = 0;
int32_t *board = context.board;
int32_t res_flag = 0;
if ((j <= 0xb && i <= 0xb))
{
int32_t r12_1 = (i + 0x79); // ignore this
int32_t y = -0xb; // First increment
int32_t rbp_1 = (j + 0xb); // ignore this too
while (true)
{
int32_t y_cpy = (y + 1);
if (y != 0)
{
int32_t r8 = r12_1; // ignore for now
int32_t x = -0xb; // Second increment
int32_t neg = (y >> 0x1f); // Probably interesting
int32_t r11_2 = ((y_cpy * -0xb) + rbp_1); // but I can't seem
int32_t y_cpy_cpy = ((neg ^ y) - neg); // to care enough
while (true)
{
if (x != 0)
{
int32_t neg = (x >> 0x1f); // Abs value again or
int32_t x_cpy = ((neg ^ x) - neg); // something
int32_t y_cpy_cpy_cpy = y_cpy_cpy;
int32_t count;
while (true) // Euclidean algorithm
{
int32_t y_cpy_cpy;
int32_t _;
_ = HIGHW(((int64_t)y_cpy_cpy_cpy));
y_cpy_cpy = LOWW(((int64_t)y_cpy_cpy_cpy));
int32_t mod = (COMBINE(y_cpy_cpy, y_cpy_cpy) % x_cpy);
y_cpy_cpy_cpy = x_cpy;
count = mod;
if (mod == 0)
{
break;
}
x_cpy = mod;
}
if (x_cpy == 1) // Check if increments are coprime
{
int32_t row = r8; // Stuff we ignored but seems to
int32_t column = r11_2; // be the starting point
int32_t m = 0x17;
int32_t m_cpy;
do // Run through the board using coprime increments
{
// Check if we are inside the board
if ((column <= 0xb && row <= 0xb))
{
// Count the bits
count += (board[row] >> column) & 1;
}
column = (column + y);
row = (row + x);
m_cpy = m;
m = (m - 1);
} while (m_cpy != 1);
// STOP THE COUNT
int32_t too_much = count > 2;
res_flag = (res_flag | too_much);
}
if (x == 0xb) // End the while true loops and iterate
{
break;
}
}
x = (x + 1); // Update increment
r8 = (r8 - 0xb);
}
if (y_cpy == 0xc)
{
break;
}
}
y = y_cpy; // Update increment with the next one
}
res = res_flag == 0;
}
return ((uint64_t)res);
}
```

#### Alignement count

So now I am really unhappy, I was having fun finding out puzzle and placing
bits on boards but now `\J`

is throwing modular arithmetic at me.

I take a break crying in my bed after these findings before actually thinking.

Running through the array using coprime increments probably has a really interesting property that may be plotted visually.

So I reimplemented this function in python but instead of summing the bits I encountered, I mark the bits I would have summed to see the path that I take in the array.

Which gives us this much more readable code:

```
#!/usr/bin/env python3
import os
from math import gcd
def pretty_print(arr):
res = ""
for row in arr:
res += '|'
for el in row:
res += el + '|'
res += '\n'
return res
def count_alligned(i, j):
for y in range(-0xb, 0xc):
if y == 0:
continue
start_row = i + 0x79
start_column = (y + 1) * -0xb + j + 11
for x in range(-0xb, 0xc):
if x == 0:
start_row -= 0xb
continue
pgcd = gcd(y, x)
if pgcd == 1:
arr = [[' ' for _ in range(0xc)] for _ in range(0xc)]
row = start_row
column = start_column
for _ in range(0x17):
if column <= 0xb and row <= 0xb and column >= 0 and row >= 0:
print(row, column)
arr[row][column] = 'X'
column += y
row += x
print(pretty_print(arr))
start_row -= 0xb
count_alligned(5, 6)
```

With this output (only a sample):

```
10 8
5 6
0 4
| | | | |X| | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | |X| | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | |X| | | |
| | | | | | | | | | | | |
11 10
8 8
5 6
2 4
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | |X| | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | |X| | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | |X| | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | |X| |
7 10
6 8
5 6
4 4
3 2
2 0
| | | | | | | | | | | | |
| | | | | | | | | | | | |
|X| | | | | | | | | | | |
| | |X| | | | | | | | | |
| | | | |X| | | | | | | |
| | | | | | |X| | | | | |
| | | | | | | | |X| | | |
| | | | | | | | | | |X| |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
3 10
4 8
5 6
6 4
7 2
8 0
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | |X| |
| | | | | | | | |X| | | |
| | | | | | |X| | | | | |
| | | | |X| | | | | | | |
| | |X| | | | | | | | | |
|X| | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
2 8
5 6
8 4
11 2
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | |X| | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | |X| | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | |X| | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | |X| | | | | | | | | |
```

It is now obvious what this function is doing, it draws every straight line passing through the point given as parameter and checks if more than 2 points of the board are alligned.

If we remember the context were this function is used, it means that we will set a bit in the third bitboard if and only if no line passing through this point intersects more than 2 points.

### Case 0 (Final check)

So case 0 is the instruction that performs the final check to see if our board match the desired state. Again there is a lot of SIMD magic going on so I tried to clean it a little bit, you lose the actual result of the decompilation but the code is semantically equivalent.

First we perform a pop count on all lines of the `visited`

bitboard, comparing
the total sum to `0x90`

which is `12 * 12`

, the total number of tiles. We must
therefore go through every single tile of the board exactly once.

Then, we pack the third bitboard (the one that stores if 3 points are alligned)
into the `zmm0`

128 bit register because why not? And basically `shifting`

and
`bitwise anding`

it to check that every bit of the board are set. So its
basically the same check than before but for the third bitboard, meaning that
no line drawn from any point on the board must intersect with more than 2
points.

The last check simply performs a xor on every line of the board and outputs an error if it is different from 0. This means that every column of the board must contain an even number of points.

```
case 0:
{
int32_t count = 0;
for (int i = 0; i <= 0xb; ++i) // Refactorized SIMD instruction as loop
count += pop_count(context.visited[i]);
int32_t err = count != 0x90;
err = err | context.err;
context.err = err; // Set error
int128_t third_board_1 = context.third_board; // Trying to clean up SIMD
int128_t third_board_2 = *(((uint128_t *)&context.third_board) + 1);
int128_t third_board_3 = *(((uint128_t *)&context.third_board) + 2);
int128_t zmm0 = ((third_board_1 & third_board_2) & third_board_3);
zmm0 = (zmm0 & _mm_bsrli_si128(zmm0, 8)); // More SIMD magic
zmm0 = (zmm0 & _mm_bsrli_si128(zmm0, 4));
int32_t all_set = (zmm0 & 0xfff) != 0xfff;// Check that context.third_board
err = all_set | err; // is all set to 1
context.err = err;
uint32_t flag = 1; // Set error flag
if (context.bit_queue == 0) // Check that we emptied the bit queue
{
int32_t x = 0;
for (int i = 0; i <= 0xb; ++i) // SIMD refactor
x ^= context.board[i];
flag = x;
}
res = ((uint64_t)(err | flag));
break;
break;
}
```

Great we have everything ready to go! We simply need to go trough the board and place 24 points on it and make sure there are never 3 points alligned.

So it turns out this problem is NP-complete (I learned after solving the challenge that is called the Not-three-in-line problem), I might have a hard time solving it manually. I might start to implement a SAT solver or som…

But wait, did you spot something sus with this implementation ?

## Bypassing the puzzle

There are actually two distinct bugs in this implementation of the puzzle.

First, the program does not check that you place a point when there is already one. It does check that you do not go twice on the same tile but you can still place multiple points without moving.

The second, the one I exploited here, is that the weird function we
reimplemented actually only count alligned points on all straight lines
**except** for lines and columns.

The special case of the line is checked independently in case 3 when inserting the point if you remember well.

However the only columns check is in case 0 where it simply checks that there is an even number of point per column.

Nothing forbids us to put 12 points on the first and last column, which validates all our constraints.

```
|X| | | | | | | | | | |X|
|X| | | | | | | | | | |X|
|X| | | | | | | | | | |X|
|X| | | | | | | | | | |X|
|X| | | | | | | | | | |X|
|X| | | | | | | | | | |X|
|X| | | | | | | | | | |X|
|X| | | | | | | | | | |X|
|X| | | | | | | | | | |X|
|X| | | | | | | | | | |X|
|X| | | | | | | | | | |X|
|X| | | | | | | | | | |X|
```

## Solve

We now simply need to write the input that will give instructions to the program to write this bugged board.

We first fill the first column, by going downward.

Then we go through the 10 next columns without placing any bit and we fill the last one.

Finally we need to fill the third bitboard by manually performing the alignment check for every point in the bitboard so we call case 4 for every coordinates.

We do not forget to put case 0 at the end to validate everything went fine.

```
#!/usr/bin/env python3
import os
def check_align(i, j):
return b'\x04' + j.to_bytes(1, 'little') + i.to_bytes(1, 'little')
def cycle():
return b'\x01'
def done():
return b'\x00'
def move():
return b'\x02'
def place():
return b'\x03'
def fill_column(skip = True):
res = b''
for _ in range(0xb):
if not skip:
res += place()
res += move()
if not skip:
res += place()
return res
def fill():
res = b''
res += cycle() * 3
res += fill_column(False)
for _ in range(5):
res += cycle() + move() + cycle()
res += fill_column()
res += cycle() * 3 + move() + cycle() * 3
res += fill_column()
res += cycle() + move() + cycle()
res += fill_column(False)
for i in range(0xc):
for j in range(0xc):
res += check_align(i, j)
return res + done()
res = fill()
os.write(1, res)
```

```
$ ./solve.py | ./diplodocus
Well done! You can submit your input to the remote service to grab the flag!
```

```
$ ./solve.py | nc challenges.france-cybersecurity-challenge.fr 2201
Well done, the flag is FCSC{bf809d2614501166a890740116103410a69ede950b57a9186bf49eb734eaa1a1}
```