## Efficient bit permutation using delta swaps

A common operation in cryptographic block cipher algorithms is to shuffle the input bits in some specified way. While such bit permutations are easy to do in hardware, they tend to be difficult to implement efficiently in software. The Initial Permutation (IP) in the DES block cipher is a good case in point. It is defined as the following rearrangement of the 64 input bits:

A naive implementation would loop through each of the 64 input bits one by one and place them at the correct position in the output. Needless to say, this would be rather slow. A pre-computed lookup table for a 64-bit permutation would be too huge to even consider, while multiple piece-wise lookup tables would most likely not offer any performance benefit over a naive loop. However, there is a clever solution using only 30 bitwise operations (15 XORs, 10 shifts, 5 ANDs). The key idea is to repeatedly swap non-overlapping groups of bits with an operation called a delta swap:

In a nutshell, the delta swap operation can simultaneously swap multiple bit pairs where the two bits in each pair are separated by the same distance from each other. Let’s examine how this works. (Notation: $\oplus$ means XOR, $\ll$ means bitwise shift left, $\gg$ means bitwise shift right)

We start with the input block represented as a sequence of 64 bits:

$a=a_{1}a_{2}...a_{64}$

Suppose we want to swap the bit at position $i$ with the bit at position $j$, where $j > i$. The distance between the two bits is

$\delta=j-i$

Now notice that if we construct a temporary value $b=b_{1}b_{2}...b_{64}$ such that

$b_{n} = \left\{ \begin{array}{lr}a_{i} \oplus a_{j} & : n \in \{i,j\} \\ 0 & : n \notin \{i,j\} \end{array} \right.$

then $a \oplus b$ swaps the bits $a_{i}$ and $a_{j}$ as required, because of the following properties of XOR:

$a_{i} \oplus (a_{i} \oplus a_{j})=a_{j} \\ a_{j} \oplus (a_{i} \oplus a_{j})=a_{i}$

$a_{i} \oplus 0=a_{i}$

This is exactly how a delta swap works. The temporary value $b$ is constructed by shifting the input $a$ to the right $\delta$ bits, then XORing with the unshifted version of $a$. This places $a_{i} \oplus a_{j}$ in $b_{j}$. The other bits of $b$ are then zeroed out by ANDing with a mask that only has the bit at position $j$ turned on. Finally, the bit swap in $a$ is accomplished by XORing $a$ with an unshifted $b$ and then with $b$ shifted left $\delta$ bits. This is illustrated diagrammatically below.

In order to swap multiple bit pairs simultaneously, we need only turn on the appropriate bits in the AND mask. Specifically, for each 1 bit in the mask, the corresponding bit in the input will be swapped with the input bit that is $\delta$ positions to its left.

Now that we’re armed with the delta swap technique, let’s return to the original problem of implementing the DES Initial Permutation efficiently. Notice that if we arrange the 64-bit input block in the DES algorithm as an 8 x 8 bit matrix, then the Initial Permutation can be visualized as a flip of the matrix about its diagonal (the one that extends from bit 57 to bit 08), followed by the shuffling of its rows. We can accomplish this with a sequence of five delta swaps. The first three flip the matrix about its diagonal, and the last two shuffle the rows.

The DES Final Permutation (FP), being the inverse of the Initial Permutation, can be implemented by performing the same delta swaps in reverse order. The reason why the delta swap technique is so useful is that it works for any permutation of bits. In fact, it can be shown that for 64-bit permutations the required number of delta swaps will never exceed 11.