Some future works were already discussed in previous posts:
implementing peephole optimizations for AVX512 (e.g. using
vpternlog) is discussed in the Backend post, and improving the performances on
embedded devices (e.g. by targeting assembly rather than C) is
discussed in the Tornado post. We propose in this post several additional
ideas to improve Usuba.
mslicing on general purpose registers
In Section Monomorphization of the Evaluation post, we mentioned that the lack of
x86-64 instruction to shift 4 16-bit works in a single 64-bit register
prevents us from parallelizing Rectangle’s vsliced implementation on
general purpose registers.
In practice though, the
_pdep_u64 intrinsic can be used to
interleave 4 independent 64-bit inputs of Rectangle in 4 64-bit
registers. This instruction takes a register
a and an integer
as parameters, and dispatches the bits of
a to a new registers
following the pattern specified in
mask. For instance, using
_pdep_u64 with the mask
0x8888888888888888 would dispatch a 16-bit
input of Rectangle into a 64-bit register as follows:
A second input of Rectangle could be dispatched into the next bits
using the mask
Repeating the process with two more inputs using the masks
0x1111111111111111, and then combining the
results (using a simple
or) would produce a 64-bit register
containing 4 interleaved 16-bit inputs of Rectangle. Since Rectangle’s
whole input is 64-bit, this process needs to be repeated 4
times. Then, regular shift and rotate instructions can be used to
independently rotate each input. For instance, a left-rotation by 2
can now be done by a simple left-rotation by 8:
In essence, this technique relies on a data-layout inspired by
hslicing (bits of the input are split along the horizontal direction
of the registers), while offering access to a limit set of vslice
instructions: bitwise instructions and shifts/rotates can be used, but
arithmetic instructions cannot. We leave for future work the
investigation of how this mode of slicing could be incorporated within
Usuba in order to increase performance when vector extensions are not
Recall the main computing nodes of Chacha20:
node QR (a:u32, b:u32, c:u32, d:u32)
returns (aR:u32, bR:u32, cR:u32, dR:u32)
a := a + b;
d := (d ^ a) <<< 16;
c := c + d;
b := (b ^ c) <<< 12;
aR = a + b;
dR = (d ^ aR) <<< 8;
cR = c + dR;
bR = (b ^ cR) <<< 7;
node DR (state:u32x16) returns (stateR:u32x16)
state[0,4,8,12] := QR(state[0,4,8,12]);
state[1,5,9,13] := QR(state[1,5,9,13]);
state[2,6,10,14] := QR(state[2,6,10,14]);
state[3,7,11,15] := QR(state[3,7,11,15]);
stateR[0,5,10,15] = QR(state[0,5,10,15]);
stateR[1,6,11,12] = QR(state[1,6,11,12]);
stateR[2,7,8,13] = QR(state[2,7,8,13]);
stateR[3,4,9,14] = QR(state[3,4,9,14]);
Within the node
DR, the node
QR is called 4 times on independent
inputs, and then 4 times again on independent inputs. Visually, we can
represent the first 4 calls to
In vslicing, each 32-bit work of the state would be mapped to a SIMD
register, and those registers would be filled with independent
inputs. For instance, on SSE, 4 states would be processed in parallel:
The last 4 calls to
QR in the second half of
DR would then be:
Using pure vslicing may be sub-optimal because 16 registers are
required, which leaves no register for temporary variables on SSE or
AVX. In practice, at least one register of the state will have to be
spilled to memory.
Rather than parallelizing Chacha20 by filling SSE or AVX register with
independent inputs, it is possible to do the parallelization of a
single instance on SSE registers (respectively, two instances on
AVX2). Since the first 4 calls to
QR operate on independent values,
we can pack the 16 32-bit words of the state within 4 SSE (or AVX)
registers, and a single call to
QR will compute it four times on a
Three rotations are then needed to reorganize the data for the final 4
On SSE or AVX registers, those rotations would be done using Shuffle
instructions. Finally, a single call to
QR computes the last 4
Compared to vslicing, this technique requires less independent inputs:
on SSE (resp. AVX), it requires only 1 (resp. 2) independent inputs to
reach the maximal throughput, while in vslicing, it requires 4
(resp. 8). Also, the latency is divided by 4 compared to vslicing,
which will speed up applications that require encrypting less than 4
The downside is that it requires additional shuffles to reorganize the
data withing each round and at the end of each round, which incurs an
overhead compared to vslicing. Furthermore, recall from the
Scheduling post that Chacha20’s quarter round (
QR) is bound by
data-dependencies. In vslicing, Usuba’s scheduling algorithm is able
to interleave 4 quarter rounds, thus removing any stalls related to
data dependencies. This is not possible with this mixed sliced form,
since only two calls to
QR remain, both of which cannot be computed
However, a lower register pressure allows such an implementation of
Chacha20 to be implemented without any spilling, which improves
performance. Furthermore, because this implementation only uses 5
registers (4 for the state + 1 temporary), it could be interleaved 3
times without introducing any spilling. This interleaving would remove
the data hazards from
QR. The performances of this spill-free
3-interleaved implementation (despite containing additional shuffles)
would need to be compared to the classical vsliced one. As shown in
the fastest implementation of Gimli on AVX2 uses this technique, and
is faster than Usuba’s vsliced implementation.
We can understand this technique as an intermediary stage between
vertical and horizontal slicing. One way to incorpore it to Usuba
would be to represent atomic types with a word size m and a
vector size V (
umV), instead of
a direction D and a word size m
uDm). The word size would correspond
to the size of the packed elements within SIMD registers, and the
vector size would represent how many elements of a given input are
packed within the same input. For instance, Chacha20’s new
implementation would manipulate a state of 4 values of type
Using this representation, a vslice type
uVm would correspond to
um1, while an hslice type
uHm would correspond to
u1m, and a bitslice type
uD1 would unsurprisingly correspond to
umV would be valid only if the
targeted architecture offered SIMD vectors able to contain V words
of size m. Arithmetic instructions would be possible between two
umV values (provided that the
architecture offers instruction to perform m-bit arithmetics) and
shuffles would be allowed if the architectures offers instructions to
shuffle m-bit words.
Mode of operation
One of the commonly used mode of operation is counter mode (CTR). In
this mode (as illustrated below), a counter is encrypted by the
primitive (rather than encrypting the plaintext directly), and the
output of the primitive is
xored with a block of plaintext to
produce the ciphertext. The counter is incremented by one for each
subsequent block of the plaintext to encrypt.
For 256 consecutives block to encrypt, only the last byte of the
counter changes, and the other remain contant. Hongjun Wu and later
Bernstein and Schwabe  observed that the last byte of AES’s input
only impacts 4 bytes during the first round:
The results of the first round of
ShiftRows on the first 15 bytes, as well as
MixColumns on 12 bytes
can thus be cached and reused for 256 encryptions. Concretely, the
first round of AES only require computing
on a single byte (
ShiftRows does not require computation), and
MixColumns on 4 bytes instead of 16. Similarly, 12 of the 16
SubBytes of the second round can be
cached. Additionally, Park and Lee  showed that some values can be
pre-computed to speed up computation even further (while keeping the
constant-time property of sliced implementations). Only 4 bytes of the
output of the first round depends on the last byte, which can take 256
values. Thus, it is possible to precompte all 256 possible values for
those 4 bytes (which can be stored a 1KB table (4 * 256 = 1KB)), and
reuse them until incrementing the counter past the 6th
least significant byte of the counter (b10 on the
figure above, which is an input of the same
b15), that is, once every 1 trillion block.
To integrate such optimizations in Usuba, we will have to broaden the
scope of Usuba to support modes of operations: Usuba currently does
not offer support statful computations and forces loops’ bounds to be
known at compile-time. We would then need to extend the optimizer to
exploit the specificities of Usuba’s new programming model.
Systematic Evaluation on Diverse Vectorized Architectures
We focused our evaluation of vectorization to Intel architectures,
because of their wide availability. Other architectures, such as
AltiVec on PowerPC and Neon on ARM, offer similar instructions as SSE
and AVX. In particular, both AltiVec and Neon provide 128-bit
registers supporting 8/16/32/64-bit arithmetic and shift instructions
(used in vslicing), shuffles (used in hslicing) and 128-bit bitwise
instructions (used in all slicing types). As a proof-of-concept, we
thus implemented AltiVec and Neon backends in Usubac.
The figure below shows the speedup offered by vector extensions on a
bitslice DES (including the transposition) on several architectures:
SKL is our Intel Xeon W-2155 (with 64-bit general purpose registers,
128-bit SSE, 256-bit AVX2, 512-bit AVX512), PowerPC is a PPC 970MP
(with 64-bit general purpose registers and 128-bit AltiVec), and ARMv7
is an ARMv7 Raspberry Pi3 (with 32-bit general purpose registers and
128-bit Neon). The performances have been normalized on each
architecture to evaluate the speedups of vector extensions rather than
the raw performances.
The speedups offered by SIMD extensions vary from one architecture to
the other. PowerPC’s AltiVec offer 32-registers and 3-operand
instructions, and thus expectedly perform better than Intel’s SSE
comparatively to a 64-bit baseline. ARM’s Neon extensions only offer
16 registers on the other hand, resulting in a similar speedup as
Intel’s AVX2 (note that ARMv7 has 32-bit registers whereas Skylake has
This benchmark only deals with bitslicing, and thus does not exploit
arithmetic instructions, shuffles, or other architecture-specific SIMD
instructions. We leave to future work to extend this preliminary
evaluation to other ciphers and slicing modes.
Bugs in implementations of cryptographic primitives can have
devastating consequences [9,10,11,12]. To alleviate the risk of such
human errors, several recent projects aiming at generating
cryptographic codes have adopted approaches ensuring the functionnal
correctness of the code they produce [6,7,8].
The figure below illustrates how end-to-end functional correctness
could be added to the Usubac compiler.
We propose to prove the correctness of the front-end by formaly
proving that each each normalization pass preserves the semantics of
the input program. Several work already showed how to prove the
correctness of normalization of dataflow languages [3,13,14].
For the backend (optimizations), using translation validation 
would certainly be more practical. The principle of translation
validation is to check wether the program obtained after optimization
has the same behavior as the program before optimization. This does
not guarantee that the compiler is correct for all possible input
programs, but it does guarantee that the compilation of a given
program is correct. Translation validation leads to a more easily
extensible compiler than proving the correctness of each pass, since
adding a new pass of optimization does not impact the verification and
does not require a proof of correctness for this pass.
We implemented a proof-of-concept translation validation for the
optimizations of the Usubac compiler. The principle is simple: we
extract two SAT formulas from the pipeline, one before the
optimizations, and one after. We then feed them to the Boolector SMT
solver , which checks wether they are equivalent (or, more
precisely, wether an input exists such that feeding it to both
formulas produces two distinct outputs). Extraction of a SAT formula
from an Usuba0 program is straightforward thanks to our dataflow
semantics: an Usuba0 program is simply a set of equations.
We used this translation validation approach to verify the correctness
of our optimizations for Rectangle (bitslice and vslice), DES
(bitslice), AES (bitslice), Chacha20 (vslice) and Serpent (vslice). In
all cases, Boolector can able to confirm the equivalence of the
programs pre and post-optimizations in less than a second, thus
showing that this approach is practical (i.e. it can verify
equivalence of Usuba0 programs in reasonable time) and requires little
investement to be implemented.
For additional confidence in the translation validation approach, a
certified SAT solver, such as SMTCoq , could be used.
Finally, translation from Usuba0 to imperative code can be formally
verified using existing techniques of the dataflow community . We
could either target CompCert [15,16], a certified C compiler that
produces x86 assembly code. Or, if we decide to generate assembly code
ourselves (to improve performances, as motivated in the Tornado
post), we can
generate Jasmin assembly  and benefit from its mechanized
semantics. In both cases (CompCert and Jasmin), some work might be
required to support SIMD instruction sets.
A lot of implementations of cryptographic algorithms on graphics
processing units (GPU) have been proposed in the last 15 years
[18,19,20,21,22], including some implementation of bitslice DES
[27,28,29] and bitslice AES [23,24,25,26,27]. Throughputs of more than
1 Tbits per seconds are reported for AES in . The applications of
such high-throughput implementations include, among others, password
cracking [30,31], random number generation [32,33] or disk encryption
Both bitslice [25,26,27] and mslice [23,24] implementations of AES
have been demonstrated on GPUs. Additionally, Nishikawa et al. 
showed that GPUs offer a large design space: computation is broken
down into many threads, registers are a shared resource between
threads, a limited amount of thread can be executed
simultaneously… Nishikawa et al.  thus proposed an evaluation
of the impact of some of those parameters on an msliced AES. Usuba
could provide an oportunity to systematically explore this design
space across a wide range of ciphers.
 D. J. Bernstein, P. Schwabe, New AES Software Speed Records, INDOCRYPT, 2008.
 J. H. Park, D. H. Lee, FACE: Fast AES CTR mode Encryption Techniques based on the Reuse of Repetitive Data, TCHES, 2018.
 T. Bourke et al., A Formally Verified Compiler for Lustre, PLDI, 2017.
 A. Pnueli et al., Translation validation, TACAS, 1998.
 A. Niemetz and al., Boolector 2.0 system description, Journal on Satisfiability, Boolean Modeling and Computation, 2015.
 J. K. Zinzindohoué et al., HACL*: A verified modern cryptographic library, ACM Conference on Computer and Communications Security, 2017.
 J. B. Almeida et al., Jasmin: High-Assurance and High-Speed Cryptography, CCS, 17.
 B. Bond et al., Vale: Verifying High-Performance Cryptographic Assembly Code, USENIX Security Symposium, 2017.
 B. B. Brumley et al., Practical realisation and elimination of an ECC-related software bug attack, CT-RSA, 2012.
 S. Gueron, V. Krasnov, The fragility of AES-GCM authentication algorithm, 2013.
 H. Boeck, Wrong results with Poly1305 functions, 2016.
 D. Benjamin, poly1305-x86.pl produces incorrect output, 2016.
 C. Auger et al., A Formalization and Proof of a Modular Lustre Compiler, 2012.
 C. Auger, Compilation certifiée de SCADE/LUSTRE, 2013.
 S. Blazy et al., Formal verification of a C compiler front-end, FM, 2006.
 X. Leroy, Formal verification of a realistic compiler, Communications of the ACM, 2009.
 M. Armand et al., A modular integration of SAT/SMT solvers to Coq through proof witnesses, CPP, 2011.
 S. A. Manavski et al., CUDA compatible GPU as an efficient hardware accelerator for AES cryptography, ICSPC, 2007.
 R. Szerwinski, T. Gïneysu, Exploiting the Power of GPUs for Asymmetric Cryptography, CHES, 2008.
 D. A. Osvik et al., Fast Software AES Encryption, FSE, 2010.
 O. Harrison, J. Waldron, Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware, AFRICACRYPT, 2009.
 D. Cook et al., CryptoGraphics: Secret key cryptography using graphics cards, CT-RSA, 2005.
 R. K. Kim et al., Bitsliced High-Performance AES-ECB on GPUs, The New Codebreakers, 2016.
 N. Nishikawa et al., Implementation of Bitsliced AES Encryption on CUDA-Enabled GPU, NSS, 2017.
 O. Hajihassani et al., Fast AES Implementation: A High-throughput Bitsliced Approach, IEEE Transactions on Parallel and Distributed Systems, 2019.
 B. Peccerillo et al., Parallel bitsliced AES through PHAST: a single-source high-performance library for multi-cores and GPUs, Journal of Cryptographic Engineering, 2019.
 J. Yang, J. Goodman, Symmetric Key Cryptography on Modern Graphics Hardware, ASIACRYPT, 2007.
 N. Dennier et al., Improved Software Implementation of DES Using CUDA and OpenCL, 2011.
 G. Agosta et al., Record setting software implementation of DES using CUDA, 2010.
 L. Sprengers, GPU-based password cracking, 2011.
 M. Bakker, R. Van Der Jagt, GPU-based password cracking, 2010.
 S.K. Monfared et al., High-performance Cryptographically Secure Pseudo-random Number Generation via Bitslicing, 2019.
 W.K. Lee et al., Fast implementation of block ciphers and PRNGs in Maxwell GPU architecture, Cluster Computing, 2016.
 G. Agosta et al., Fast disk encryption through GPGPU acceleration, PDCAT, 2009.
 W. Sun et al., GPUstore Harnessing GPU Computing for Storage Systems in the OS Kernel, SYSTOR, 2012.