The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial's answer.
On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):
// Branch - Random
seconds = 8.885
// Branch - Sorted
seconds = 1.528
// Branchless - Random
seconds = 3.716
// Branchless - Sorted
seconds = 3.71
// Branch - Random
seconds = 11.302
// Branch - Sorted
seconds = 1.830
// Branchless - Random
seconds = 2.736
// Branchless - Sorted
seconds = 2.737
In a typical x86
processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipelining.