Improving AES-GCM Performance in NSS

AES-GCM is a NIST standardised authenticated encryption algorithm (FIPS 800-38D). Since its standardisation in 2008 its usage increased to a point where it is the prevalent encryption used with TLS. With 85% it is by far the most widely used cipher.

Firefox 53 TLS cipher telemetry

Unfortunately the AES-GCM implementation used in Firefox (provided by NSS) does not take advantage of full hardware acceleration; it uses a slower software-only implementation on Mac, Linux 32-bit, or any device that doesn't have the AVX, PCLMUL, and AES-NI hardware instructions. Looking at these numbers about only 30% of Firefox users get full hardware acceleration.

Before jumping on the obvious issue of Firefox’s AES-GCM code being comparatively slow – as well as not being resistant to side channel analysis – I wanted to see what the actual impact on Firefox users is. Downloading a file on a mid-2015 MacBook Pro Retina with Firefox incurs CPU usage of 50% with 17% of Firefox's CPU usage is in ssl3_AESGCM. In comparison, Chrome only creates 15% CPU usage with the same download. On a Windows laptop with an AMD C-70 (no AES-NI) Firefox CPU usage is 60% and the download speed is capped at 3.5MB/s. Chrome on the same machine needs 50% of the CPU and downloads with 10MB/s. This doesn't seem to be only an academic issue: Particularly for battery-operated devices, the energy consumption difference would be noticeable.

Improving GCM performance

Speeding up the GCM multiplication function is the first obvious step to improve AES-GCM performance. A bug was opened on integration of the original AES-GCM code to provide an alternative to the textbook implementation of gcm_HashMult. This code is not only slow but has timing side channels. Here an excerpt from the binary multiplication algorithm.

for (ib = 1; ib < b_used; ib++) {
    b_i = *pb++;

    /* Inner product:  Digits of a */
    if (b_i)
        s_bmul_d_add(MP_DIGITS(a), a_used, b_i, MP_DIGITS(c) + ib);
        MP_DIGIT(c, ib + a_used) = b_i;

We can improve on two fronts here. First NSS should use PCLMUL to speed up the ghash multiplication if possible. Second if PCLMUL is not available, NSS should use a fast constant-time implementation.

Bug 868948 has several attempts of speeding up the software implementation without introducing timing side-channels. Unfortunately the fastest code that was proposed uses table lookups and is therefore not constant-time. Thanks to Thomas Pornin I found a nicer way to implement the binary multiplication in a way that doesn't leak any timing information and is still faster than any other code I've seen. (Check out Thomas' excellent write-up for details.)

If PCLMUL is available on the CPU, using it is of course the way to go. All modern compilers support intrinsics, which allows us to write "inline assembly" in C that runs on all platforms without having to write assembly. A hardware accelerated implementation of the ghash multiplication can be easily implemented with _mm_clmulepi64_si128.

On Mac and Linux the new 32-bit and 64-bit software ghash functions (faster and constant-time) are used on the respective platforms if PCLMUL or AVX is not available. Since Windows doesn't support 128-bit integers (outside of registers) NSS falls back to the slower 32-bit ghash code (which is still more than 25% faster).

Improving AES performance

To speed up AES NSS requires hardware acceleration on Mac as well as on Linux 32-bit and any machine that doesn't support AVX (or has it disabled). When NSS can't use the specialised AES code it falls back to a table-based implementation that is again not constant-time (in addition to being slow). Implementing AES with intrinsics is a breeze.

m = _mm_xor_si128(m, cx->keySchedule[0]);
for (i = 1; i < cx->Nr; ++i) {
    m = _mm_aesenc_si128(m, cx->keySchedule[i]);
m = _mm_aesenclast_si128(m, cx->keySchedule[cx->Nr]);
_mm_storeu_si128((__m128i *)output, m);

Key expansion is a little bit more involved (for 192 and 256 bit). But is written in about 100 lines as well.

Mac sees the biggest improvement here. Previously, only Windows and 64-bit Linux used AES-NI, and now all desktop x86 and x64 platforms use it when available.

Looking at the numbers

To measure the performance gain of the new AES-GCM code I encrypt a 479MB file with a 128-bit key (the most widely used key size for AES-GCM). Note that these numbers are supposed to show a trend and heavily depend on the used machine and system load at the time.

Linux measurements are done on an Intel Core i7-4790, Windows measurements on a Surface Pro 2 with an Intel Core i5-4300U, and Mac (Core i7-4980HQ).

Linux 64-bit encrypt Linux 32-bit encrypt

On 64-bit Linux performance of any machine without AES, PCLMUL, or AVX instructions AES-GCM 128 is at least twice as fast now. If AES and PCLMUL is available, the new code only needs 33% of the time the old code took.
The speed-up for 32-bit Linux is obviously more significant as it didn't have any hardware accelerated code before. With full hardware acceleration the new code is more than 5 times faster than before. In the worst case, when PCLMUL is not available, the speed-up is still more than 50%.

The story is similar on Windows (though NSS had fast code for 32-bit there before).
Windows 64-bit encrypt Windows 32-bit encrypt

Performance improvements on Mac (64-bit only) range from 60% in the best case to 44% when AES-NI or PCLMUL is not available.

The numbers in Firefox

NSS 3.32 (Firefox 56) will ship with the new AES-GCM code. It will provide significantly reduced CPU usage for most TLS connections or higher download rates. NSS 3.32 is more intelligent in detecting the CPU's capabilities and using hardware acceleration whenever possible. Assuming that all intrinsics and mathematical operations (other than division) are constant-time on the CPU, the new code doesn't have any timing side-channels.

On the basic laptop with the AMD C-70 download rates increased from ~3MB/s to ~6MB/s. (Keep in mind that this is the software only 32-bit implementation.)

The most immediate effect can be seen on Mac. AES_Decrypt as part of a TLS connection with NSS 3.31 used 9% CPU while in NSS 3.32 it uses only 4%.
To see general performance improvements we can look at the case where AVX is not available (which is the case for about 2/3 of the Firefox population). Assuming that at least AES-NI and PCLMUL is supported by the CPU we see the CPU usage drop from 15% to 3% (measured on Linux).