Hardware Blowfish and
Mixing Ciphers Compared

Ciphering on Chip: Performance vs. Cost

Terry Ritter

The goal of this exercise is to get some idea of the chip area required to encipher the same block under two different block cipher technologies, and to see what each implies in terms of throughput.

Here we sketch out preliminary chip gate counts and resulting performance for several hardware configurations.

How Much Chip Do We Need?

To get a general idea about how much of a chip these designs would consume, we can look at a few chip examples in current production:

Static-memory devices typically lag dynamic memory sizing by somewhat more than a generation. Current Motorola SRAM chips include a range of 4mb (4 megabit) designs.
Individual customizable gates are larger than static RAM cells. As data points, Altera has a 250k gate FPGA, and the average IBM ASIC is said to have been about 260k gates in 1996.

One might thus conclude that "a gate" is as large as 16 static RAM cells, on average, which seems a little high. But the actual value does not matter much here, because these designs are heavily dominated by RAM anyway.

A Mixing Cipher

The mixing ciphers described here have a fencing layer (a row of substitution tables), Balanced Block Mixing, fencing, mixing and fencing. These are conceptually very simple ciphers.

In a mixing cipher, the tables constitute primary keying and are set up by an external subsystem prior to ciphering.

A Balanced Block Mixer

The Mixing cipher uses Balanced Block Mixer or BBM components which realize:

     a = 3x + 2y (mod 2, mod p)
     b = 2x + 3y (mod 2, mod p).

This can be implemented as:

q := XOR( x, y );
a delayless shift of one bit left;
a conditional r := q or r := XOR( q, p ); then
a := XOR( r, x ) and b := XOR( r, y ).

This is 32 XOR gates per byte-wide mixing, with an assumed 3 level gate delay.

These byte-wide mixings are applied across the 8 bytes to be mixed, 4 pairs at a time, in FFT-like patterns. Note that each of the BBM operations is separate and independent hardware, functioning in parallel with other hardware. After 3 mixing sub-layers, every one of the 8 input bytes is equally represented in each of the 8 output bytes.

64-bit Blowfish Realizations

Consider the 8 byte block cipher Blowfish: Here we have four 8-in 32-out tables, for a total of 4K (BYTES) or 32kb (bits) of RAM. The combining is 32 bits wide, and seems to have 4 operations, for 128 gates, plus 32 bits more in keying, for a total of 160 gates. This is for one round.

Note that the 32k RAM elements in one round dominates the 160-gate logic requirements by a factor of 200+, so we can simply ignore gate area with little error. And the 32k bits of static RAM in one round represents only 1/128th of a current mass-production 4Mb static RAM chip.

There are 16 rounds, and each uses exactly the same tables.

524kb Blowfish

For extreme speed, we can give each round its own tables, for a total of 64KB or 524kb (bits). This is still only 1/8th (about 12 percent) of a 4mb static RAM chip.

Since we can partition at will and pipeline to make our clock rate, the 524kb Blowfish realization has a throughput of 8 bytes per clock, with a latency of 16 look-ups and 16 4-gate-level combinings.

32kb Blowfish

A smaller Blowfish might have a single set of tables, representing 32kb of RAM, re-used for each of 16 rounds. 32kb is 1/128th (under 1 percent) of a 4mb static RAM chip.

The 32kb Blowfish has a throughput of 8 bytes every 16 clocks, with a latency of 16 look-ups and 16 4-gate-level combinings.

64-bit Mixing Realizations

A fully-expressed 8-byte-wide Mixing cipher needs 24 8-in 8-out tables, for a total of 6K (BYTES) or 48kb (bits). We also need 4 BBM's in each mixing sublayer, times 3 sublayers, times 2 overall mixings, for a total of 24 BBM's. At 32 gates each this is 3/4k gates.

Again note that the 48k RAM elements in the cipher dominate the 768-gate logic requirements by a factor of 64, so we can generally ignore gate area. In practice, the BBM design and layout probably would be optimized and not simply constructed from gates, so we can expect this to be tighter than average anyway.

48kb Mixing

For best speed, we simply realize each separate table, for 48kb of RAM. 48kb is about 1/85th (under 2 percent) of a 4mb static RAM chip.

The 48kb Mixing cipher has a throughput of 8 bytes per clock with a latency of 3 look-ups and 6 3-gate-level BBM mixings.

16kb Mixing

The next reasonable mixing system uses a single fencing row and one full mixing layer, and re-uses those. So we need 16kb for the tables, and only one mixing hardware. 16kb is about 1/256th (less that 1/2 of 1 percent) of a 4mb static RAM chip.

The 16kb Mixing cipher has a throughput of 8 bytes per 3 clocks, with a latency of 3 look-ups and 6 3-gate-level BBM mixings.

Comparisons

Compared to 524kb Blowfish, the 48kb Mixing cipher is about 1/10th the size, has an equivalent throughput, and perhaps 1/2 the latency.

Compared to 32kb Blowfish, the 16kb Mixing cipher is about 1/2 the size, has over 5 times the throughput, and perhaps 1/4 the latency.

Terry Ritter, his current address, and his top page.

Last updated: 1997-03-27

Hardware Blowfish and Mixing Ciphers Compared