Our processors benefit from “SIMD” instructions. These instructions can operate on several values at once, thus greatly accelerating some algorithms. Earlier, I reported that you can multiply the speed of common (fast) random number generators such as PCG and xorshift128+ by a factor of three or four by vectorizing them using SIMD instructions.

A reader challenged me: is this actually useful in practice?

There are problems where the generation of random number is critical to the performance. That is the case in many simulations, for example. The simplest and best known one is probably the random shuffling of arrays. The standard algorithm is quite simple:

for (i = size; i > 1; i--) { var j = random_number(i); switch_values(array, i, j); }

If you are interested, O’Neill wrote a whole blog post of this specific problem.

So can I accelerate the shuffling of an array using SIMD instructions?

So I threw together a vectorized shuffle and a regular (scalar) shuffle, both of them using O’Neill’s PCG32. The net result? I almost double the speed using SIMD instructions when the array is in cache:

SIMD shuffle | 3.5 cycles per entry |

scalar shuffle | 6.6 cycles per entry |

My code is available. I do not do anything sophisticated so I expect it is possible to do a lot better. My sole goal was to demonstrate that SIMD random number generators are practical.

I was optimizing one of my Monte Carlo simulation programs not so long ago. The program in question was used for simulation of movement of large number of particles under different conditions, but always under assumption that no collisions can occur (large mean free path). Each particle’s movement was therefore totally independent from movement of any other particle. The simulation was already executed in several cores, but I also added another optimization where I used AVX/AVX2 intrinsics. Previously I was simulating the movement of one particle at a time in every thread, but after the optimization I started simulating the movement of four particles simultaneously in every thread. On each step I had to do interpolations, compute results of various equations and so on and optimisation with intrinsics was totally straightforward and it gave a nice (and expected) performance boost. On every step I also had to generate random numbers and I made myself a simple vectorized implementation of xoroshiro128+ and this sped up the simulation nicely, because I could generate 4 pseudorandom numbers at once, i.e. one for every particle. So, to conclude – vectorizing the PRNG turned out to be a good decision in my case.

I’d use the RDRAND machine instruction for everything except Intel messed up their early hardware implementation of it and I get an illegal instruction exception if I try to use it on my laptop.

Actually my main use is evolutionary algorithms and there was a paper that showed that pseudo-random number generator are just as good as “true” random. Also there are on-line sources of “quantum noise” random numbers.

The main reason pseudo-random is okay apparently is that the random numbers just need to be non-correlated with the problem and that is a low hurdle.