Trains are heavy and have a lot of inertia. So, so they take forever to start up and slow down.
Modern processors are complicated and have long pipelines. SoThis means they take forever to "warm up" and "slow down".
So howHow would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...
Most applications have well-behaved branches. SoTherefore, modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.
So whatWhat can be done?
GCC 4.6.1 with
-O3
or-ftree-vectorize
on x64 is able to generate a conditional move. So, so there is no difference between the sorted and unsorted data - both are fast.(Or somewhat fast: for the already-sorted case,
cmov
can be slower especially if GCC puts it on the critical path instead of justadd
, especially on Intel before Broadwell wherecmov
has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)VC++ 2010 is unable to generate conditional moves for this branch even under
/Ox
.Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So notNot only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).