Usubac - backend

Published on 2020-06-06 by Darius Mercadier

Usubac’s backend is responsible of optimizing the Usuba0 code and utlimately generating C code. Masking is also done in the backend, but will be presented in a later post.

Generating C code allows us to partially rely on C compiler’s optimizer to improve the performances of the generated ciphers. However, that alone would not be sufficient to achieve similar performances as carefuly hand-tuned codes.

We divide our optimizations in two categories. The simple ones (common subexpression elimination, inlining, unrolling…) are already done by most C compilers, but we still perform them in Usuba, mainly in order to improve the effectiveness of the more advanced ones, but also to not rely too much on the C compiler’s optimizers heuristic, which may not be tailored for cryptographic codes. The more advanced optimizations include two scheduling algorithms, and an interleaving pass, which are presented in separate posts.

Those advanced optimizations can only be done thanks to the knowledge we have in Usuba regarding the codes we are dealing with (ciphers), which C compilers do not have. One of our scheduling algorithm is thus tailored to reduce the spilling in linear layers of bitsliced ciphers, while the other one improves instruction level parallelism by mixing linear layers and S-boxes of msliced ciphers.

Usuba’s dataflow programming model is also key for our interleaving optimization: since Usuba manipulates streams rather than scalars, we are able to increase instruction level parallelism by processing several elements of the input stream simultaneously.

Autotuning

The impact of some optimizations is hard to predict. Inlining reduces the overhead of calling functions, but increases register pressure and produces codes that do not fully exploit the μop cache (DSB). Our interleaving algorithm improves instruction level parallelism, at the expense of register pressure. Our scheduling algorithm for bitsliced code tries to reduce register pressure, but sometimes fails to offer any speedup. Our scheduling algorithm for msliced codes increases instruction level parallelism, but this sometimes either increase register pressure or simply produces codes that the C compiler is less keen to optimize.

Since our goal is to produce the most efficient codes possible, it is counter-productive that some optimizations reduce performance. In order to overcome those “failed” optimizations, Usubac’s autotuner benchmarks heuristic optimizations (inlining, scheduling, interleaving), and applies only those which improve performance.

Autotuning is particularly potent in Usuba because we are in a setting where:

  • The compilation time is not as important as with traditional C compilers. The generated ciphers will likely be ran for a long time, and need to be optimized for performance in order not to bottleneck any application. Futhermore, time-costly optimizations can be disabled while prototyping in order to speed up debugging, and enabled again to generate the final optimized C code.

  • The control flow is independent of the inputs, as a direct consequence of using bitslicing and mslicing. While benchmarking a generic C program requires a representative workload (at the risk of missing a frequent case), every run of a Usuba program will have the same performances regardless of its inputs, by design.

However, rather than considering nodes (resp. loops) one by one for inlining (resp. unrolling), Usubac’s autotuner evaluates the impact of inlining all nodes and unrolling all loops. While this may lead to sub-optimal performances, the space of all possible combinations of inlining, unrolling, scheduling and interleave may be too large to be explorable in reasonable time. This is a know issue of autotuning, which is solved for instance in Halide [1] by using higher-level heuristics to guide the search of the autotuner.

We leave for future work to improve Usubac’s autotuner to evaluate a larger space of optimizations. A way to achieve this could be to use static analysis to prevent some branches to be explored by the autotuner. For instance, a node with less than 2 instructions will always be more efficient inline (to remove the overhead of calling a function). Similarly, instructions per cycle (IPC) can be statically estimated to guide optimizations: interleaving for instance would have no way to improve the performance of a code whose IPC is 4.

At the moment, Usubac’s autotuner runs its benchmarks on the machine used to compile the Usuba program, and the tuning resulting of the benchmarks may not be optimal for another architecture. In order to cross-compile for another architecture, the autotuner should either be disabled, or modified to run (remotely) its benchmarks on another machine.

Common Subexpression Elimination, Copy Propagation, Constant Folding

Common subexpression elimination (CSE) is a classical optimization that aim at preventing identical expressions to be computed multiple times. When an expression that has already been computed is recomputed, it is instead replaced by the previously computed value. For instance,

x = a + b;
y = a + b;

would be transformed by CSE into

x = a + b;
y = x;

Copy propagation is also a traditional optimization that removes assignments of a variable into another variable. For instance,

x = a + b;
y = x;
return y;

would be transformed by CP into

x = a + b;
return x;

Finally, constant folding consists in computing constant expressions at compile time. The expressions that Usuba simplifies using constant folding are either arithmetic or bitwise operations between constants (eg. replacing x = 20 + 22 by x = 42) or bitwise operations whose operand is either 0 or 0xffffffff (eg. replacing x = a | 0 by x = a).

CSE, copy propagation and constant folding are already done by C compilers. However, performing them in Usubac has two main benefits:

  • It produces smaller C codes, often with very little cost to readability. A non-negligible part of the copies removed by copy propagation comes from temporary variables introduced by the compiler itself. Removing such assignments actually improves readability. This matters particularly in bitsliced codes, which can contain tens of thousands lines of C code after those optimizations, and that would contain hundreds of thousands lines without them: on average, those optimizations reduce the number of C instructions of the bitsliced ciphers generated by Usuba by 67%. For instance, ACE bitsliced is about 200.000 instructions without those optimizations but only 50.000 with them.

  • It makes the scheduling optimizations more potent. Needless assignments and redundant expressions increase the number of live variables, which may throw off the schedulers.

Loop Unrolling

Unrolling is actually done in the frontend rather than in the backend. However, it is considered as an optimization in most compilers, and heavily impacts performances, which is why we discuss it now.

Normalization

In two cases, unrolling is necessary to normalize Usuba code down to Usuba0. The first case corresponds to shifts and rotations on tuples that depends on loop variables. For instance,

forall i in [1, 2] {
    (x0,x1,x2) := (x0,x1,x2) <<< i;
}

Since rotations on tuples are resolved at compile time, this one requires the loop to be unrolled into

(x0,x1,x2) := (x0,x1,x2) <<< 1;
(x0,x1,x2) := (x0,x1,x2) <<< 2;

Which is then simplified to

(x0,x1,x2) := (x1,x2,x0);
(x0,x1,x2) := (x2,x0,x1);

Which will be optimized away by copy propagation in the backend.

Unrolling is also needed to normalize to Usuba0 calls to nodes from arrays of nodes within loops. This is for instance the case with Serpent, which uses a different S-box for each round, and whose main loop is thus:

forall i in [0,30] {
    state[i+1] = linear_layer(sbox<i%8>(state[i] ^ keys[i]))
}

Which, after unrolling, becomes:

state[1] = linear_layer(sbox0(state[0] ^ keys[0]))
state[2] = linear_layer(sbox1(state[1] ^ keys[1]))
state[3] = linear_layer(sbox2(state[2] ^ keys[2]))
...

Note that we chose to exclude both shifts and arrays of nodes from Usuba0 because they would generate sub-optimal C codes, which would rely on the C compilers to be willing to optimize away those constructs. For instance, we could have introduced conditionals in Usuba in order to normalize the first example (tuple rotation) to

forall i in [1,2] {
    if (i == 1) {
        (x0,x1,x2) := (x1,x2,x0);
    } elsif (i == 2) {
        (x0,x1,x2) := (x2,x0,x1);
    }
}

And the second example (array of nodes) to

forall i in [0,30] {
    if (i % 8 == 0) {
        state[i+1] = linear_layer(sbox0(state[i] ^ keys[i]))
    } elsif (i % 8 == 1) {
        state[i+1] = linear_layer(sbox1(state[i] ^ keys[i]))
    } elsif 
        ...
    }
}

This Usuba0 could would then have been compiled to C loops. However, to be efficient, it would have relied on the C compiler unrolling the loop in order to remove the conditionals and optimize (e.g. with copy propagation) the resulting code. In practice, C compilers avoid unrolling large loop, and performing the unrolling in Usuba leads to better and more predictable performances.

Optimization

In practice, Usubac automatically unrolls all loops by default. Experimentally, this produce the most efficient codes. The user can still use the flag -no-unroll to disable non-essential unrolling (i.e. unrolling that is not required by the normalization). In the following, we explain the reasoning behind the aggressive unrolling performed by Usubac.

Loop unrolling is clearly beneficial for bitsliced ciphers as almost all ciphers contain some kind of permutations, shifts or rotations to implement their linear layers. After unrolling (and only in bitslice mode), those operations can be optimized away by copy propagation at compile time.

For msliced codes, the rational for unrolling is more subtle. Very small loops are always more efficient when unrolled since the overhead of looping would hurt performance.

Furthermore, unrolling small and medium-sized loops is often beneficial as well because it allows our scheduling algorithm to be more efficient. Most loops contain dependencies from one iteration to the next one, which may limit their performances, and thus benefit from being unrolled and interleaved (by the scheduler) with other parts of the cipher. The scheduling post shows for instance the example of ACE, which is bottlenecked by a loop and that our scheduling algorithm is able to optimize by interleaving 3 loops (at the condition that they are unrolled).

We may want to keep large loops in the final code in order to reduce code size and maximize the usage of the μop cache (DSB). For instance, in Chacha20, unrolling the main loop (i.e. the one that calls the round function) does not offer any performance improvement (nor does it decrease performance for that matter). However, it is hard to chose an upper bound above which unrolling should not be done. For instance, Pyjamask contains a matrix multiplication in a loop:

forall i in [0, 3] {
    output[i] = mat_mult(M[i], input[i]);
}

After inlining, mat_mult becomes 160 instructions, which is more that the number of instructions in Rectangle’s round (15), or in Ascon’s (32) or in Chacha20’s (96). However, those instructions are heavily bottlenecked by data-dependencies, and unrolling the loop in Usuba (Clang choses not to unroll it on its own) speeds up the performances of Pyjasmask by a factor 1.68.

In practice, we thus chose to aggressively unroll all loops in Usuba, regardless of their contents. We never observed any performance regression because of our unrolling.

Inlining

The decision of inlining nodes is partly justified by the usual reasoning applied by C compilers: a function call implies a significant overhead that, for very frequently called functions (such as S-boxes, in our case) compensates for the increase in code size. However, Usuba’s m-sliched code scheduling algorithm tries to interleave nodes to increase instruction level parallelism, and thus requires them to be inlined. We thus perform some inlining in Usuba. The following table shows the speedups gained by inlining all nodes on some msliced ciphers, compared to inlining none:

mslicing
Cipher Inlining speedup
clang gcc
x86 AVX2 x86 AVX2
ACE 1.54 1.33 1.64 1.01
AES - 1.01 - 1.43
Ascon 1.20 1.01 1.89 1.15
Chacha20 1.25 1.11 1.23 1.20
Clyde 1.16 1.02 1.16 1.22
Gift 1.69 0.93 1.37 1.05
Gimli 0.97 0.99 1.23 1.33
Pyjamask 1.35 0.99 1.08 1.11
Rectangle (H) - 0.96 - 0.97
Rectangle (V) 1.00 0.99 0.97 0.96
Serpent 1.01 0.99 1.27 1.27
Xoodoo 1.25 0.98 1.61 1.39

The impact of inlining depends on which C compiler is used, and the architecture targeted. For instance, inlining every node of Xoodoo speeds it up by a factor 1.61 on general purpose x86 register when compiling with gcc, but slows it down by a factor 0.98 on AVX2 registers when compiling with Clang. Overall, inlining every nodes is generally beneficial for performances, providing speedups of up to 1.89 (Ascon on GP x86 registers with gcc), but can sometimes be detrimental and reduce the performances by a few percents.

On AVX2 registers, when compiling with Clang, inlining tends to be slightly detrimental, as can be seen from the lower half of the third column. Those ciphers are the ones that benefits the less from our scheduling optimization, as can be seen from the table in the section mslicing of the scheduling post. One of the reason for this performance regression is the fact that fully inlined AVX code do not use the μop cache (DSB), but falls back to the MITE. On Gift compiled with Clang for instance, when all nodes are inlined, almost no μop is issued by the DSB, while when no node is inlined, 85% of the μops come from the DSB. This translates directly into a reduces instruction per cycle count: the inlined version is at 2.58 instructions per cycles, while the non-inlined version is at 3.25. The translates in a mere 0.93 slowdown however, because the fully inlined code still contains 15% less instructions. This impacts less general purpose registers than AVX because their instructions are smaller, and the MITE can thus decode more of them each cycle.

The gain of inlining are not only explained by the scheduling that follows. For instance, scheduling improves the performances of Gift, Clyde, Xoodoo and Chacha20 on general purpose by merely x1.01, x1.02, x1.03 and x1.05, yet fully inlining those ciphers speeds them up by x1.69, x1.16, x1.25 and x1.25 (with Clang). In those cases, both Clang and gcc chose not to be too aggressive on inlining, probably in order not to increase code size too much, but this came at the expense of performance.

Bitslicing, however, definitelly confuses the inlining heuristics of C compilers. A bitsliced node compiles to a C function taking hundreds of variables as inputs and outputs. For instance, the round function in DES takes 120 arguments once bitsliced. Calling such a function requires the caller to push hundreds of variables onto the stack while the callee has to go through the stack to retrieve them, leading to a significant execution overhead but also a growth in code size. Similarly, a permutation may be compiled into a function that takes hundreds of arguments and just does assignments, while once inlined, it is virtually optimized away by copy propagation. C compilers avoid inlining such functions because their code is quite large, missing the fact that they would be optimized away.

The following table shows the performance impact of inlining in bitsliced ciphers:

bitslicing
Cipher Inlining speedup
clang gcc
x86 AVX2 x86 AVX2
ACE 1.16 1.54 1.75 3.57
AES 1.28 1.64 1.27 1.43
Ascon 1.20 2.50 1.45 2.56
Clyde 1.08 1.79 1.02 1.35
DES 1.41 1.96 1.30 1.72
Gift 3.70 5.88 3.45 8.33
Gimli 1.41 2.00 1.79 3.03
Photon 1.18 1.75 2.00 2.44
Present 1.23 1.08 1.05 0.97
Pyjamask 5.26 1.20 8.33 8.33
Rectangle 1.72 2.33 1.59 2.44
Skinny 2.63 4.00 2.78 4.76
Spongent 1.52 3.12 1.49 3.03
Subterranean 2.00 3.03 1.96 2.86
Xoodoo 1.37 2.33 1.47 2.08

Inlining improves performances in all cases, reaching an impressive 8 times speed up for Pyjamask on general purpose registers with gcc. While some of those improvements are explained by the scheduling opportunities enabled by inlining, most of them are due to the overhead saved by not calling functions, and by the copy propagation being able to remove unnecessary asignments. One of the take away from those benchmarks is that C compilers’ heuristic for inlining are not suited for Usuba-generated bitsliced codes.

Code generation

Compiling Usuba0 to C is very straightforward. Nodes are translated to function definitions, and node calls to function calls. All expressions in Usuba0 have a natural equivalent in C, with the exception of Usuba’s Shuffle, which can only be compiled for SIMD architecture and uses the available intrinsics (_mm256_shuffle_epi32 or _mm256_shuffle_epi8 on AVX2 for instance).

The generated C code relies on macros rather than inline operators. For instance, compiling the following Usuba0 nodes to C:

node sbox (i0:u32, i1:u32)
     returns (r0:u32, r1:u32)
vars t1 : u32
let
    t1 = ~i0;
    r0 = t1 & i1;
    r1 = t1 | i1;
tel

produces

void sbox__ (/*inputs*/ DATATYPE i0__,DATATYPE i1__, 
             /*outputs*/ DATATYPE* r0__,DATATYPE* r1__) {
  // Variables declaration
  DATATYPE t1__;

  // Instructions (body)
  t1__ = NOT(i0__);
  *r0__ = AND(t1__,i1__);
  *r1__ = OR(t1__,i1__);
}

Where DATATYPE is unsigned int on 32-bit registers, __m128i on SSE, __m256i on AVX2, etc., and NOT, AND and OR are defined to use the architecture’s instructions. Using macros allows us to change the architecture of the generated code by simply changing a header. The new header must provide the same instructions, which means that for instance a code compiled for AVX2 and using Shuffles cannot be ran on general purpose registers by simply changing the header since no shuffle would be available.

Usubac performs no architecture-specific optimizations, beyond its scheduling and interleaving that targets superscalar architectures. Put otherwise, we do not compile any differently a code for SSE or AVX2, execpt that Usubac’s automatic benchmarking may select different optimizations for each architecture. For general purpose registers, SSE, AVX and AVX2, the instructions used for cryptographic primitives are fairly similar, and we felt that there was no need to optimize differently on each architecture. In most cases where architecture-specific instructions can be used to speed up computation, Clang and gcc are able to detect it and perform the optimization for us.

The AVX512 instruction is however much richer, and opens the door for more optimizations. For instance, it offers the instruction vpternlogd which can compute any boolean function with 3 inputs. It can be useful to speed up S-boxes in particular. For instance,

t1 = a ^ b;
t2 = t1 & c;

Can be written with a single vpternlog as

t2 = _mm512_ternarylogic_epi64(a,b,c,0b00010100);

Thus requiring one instruction rather than two. Clang is able to automatically perform this optimization in some cases, but we leave for future work to evaluate whether Usubac could improve on Clang on that aspect.


References

[1] J Ragan-Kelley et al., Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines, PLDI, 2013.