On constant time division

Writing constant time code is hard. We all know that. But I'm always amazed again on how difficult it is. In preparation for making NSS more constant time I looked into certain CPU instructions that are known to be not constant time.
So I wrote a little thing to measure the time (CPU cycles) needed for division.

div     rcx                     ; eax is now a/b

The CPU I'm using in this post is an Intel i7-4790 (haswell). Running the same measurements on other processors architectures will most likely yield different results. I'm describing two experiments I did. First, I look at bits in the divisor and how they influence the timing of the division. And then I look at the dividend and its influence.

The Divisor

There are two interesting things we can look at here. The very small divisors as well as the pattern that we get in larger divisors.

< 129

This graph plots the number of cycles needed for the division of numbers < 130. It's easy to see that the number of cycles needed is in general around 300. However, there are a couple of "outliers" that need significantly less.

128    123
 64    135
 32    132
 16    132
  8    123
  4    123
  2    123
  1    126

You might have spotted the pattern already. The CPU probably doesn't want to do this expensive division and simply shifts the dividend to get the result. This trend continues later on but is more difficult to observe. (And there's a more interesting pattern there.)

The larger pattern

The graph shows a striking pattern and makes it obvious that division is not constant-time. Let's try to understand what's happening here.

The first part of the pattern can be explained by changing bit-lengths in the divisor. At 8191 and 16383 for example a streak of fast division ends. This is probably because the 8192 and 16384 require an additional bit each.

However, I'm clueless on how to explain the rest of the pattern.

(I'll update this post when I figured out what's going on.)