No doubt some of us would be interested in ways of identifying code that is problematic for the CPU's branch-predictor. The Valgrind tool `cachegrind` has a branch-predictor simulator, enabled by using the `--branch-sim=yes` flag. Running it over the examples in this question, with the number of outer loops reduced to 10000 and compiled with `g++`, gives these results: **Sorted:** ==32551== Branches: 656,645,130 ( 656,609,208 cond + 35,922 ind) ==32551== Mispredicts: 169,556 ( 169,095 cond + 461 ind) ==32551== Mispred rate: 0.0% ( 0.0% + 1.2% ) **Unsorted:** ==32555== Branches: 655,996,082 ( 655,960,160 cond + 35,922 ind) ==32555== Mispredicts: 164,073,152 ( 164,072,692 cond + 460 ind) ==32555== Mispred rate: 25.0% ( 25.0% + 1.2% ) Drilling down into the line-by-line output produced by `cg_annotate` we see for the loop in question: **Sorted:** Bc Bcm Bi Bim 10,001 4 0 0 for (unsigned i = 0; i < 10000; ++i) . . . . { . . . . // primary loop 327,690,000 10,016 0 0 for (unsigned c = 0; c < arraySize; ++c) . . . . { 327,680,000 10,006 0 0 if (data[c] >= 128) 0 0 0 0 sum += data[c]; . . . . } . . . . } **Unsorted:** Bc Bcm Bi Bim 10,001 4 0 0 for (unsigned i = 0; i < 10000; ++i) . . . . { . . . . // primary loop 327,690,000 10,038 0 0 for (unsigned c = 0; c < arraySize; ++c) . . . . { 327,680,000 164,050,007 0 0 if (data[c] >= 128) 0 0 0 0 sum += data[c]; . . . . } . . . . } This lets you easily identify the problematic line - in the unsorted version the `if (data[c] >= 128)` line is causing 164,050,007 mispredicted conditional branches (`Bcm`) under cachegrind's branch-predictor model, whereas it's only causing 10,006 in the sorted version. ---------- Alternatively, on Linux you can use the performance counters subsystem to accomplish the same task, but with native performance using CPU counters. perf stat ./sumtest_sorted **Sorted:** Performance counter stats for './sumtest_sorted': 11808.095776 task-clock # 0.998 CPUs utilized 1,062 context-switches # 0.090 K/sec 14 CPU-migrations # 0.001 K/sec 337 page-faults # 0.029 K/sec 26,487,882,764 cycles # 2.243 GHz 41,025,654,322 instructions # 1.55 insns per cycle 6,558,871,379 branches # 555.455 M/sec 567,204 branch-misses # 0.01% of all branches 11.827228330 seconds time elapsed **Unsorted:** Performance counter stats for './sumtest_unsorted': 28877.954344 task-clock # 0.998 CPUs utilized 2,584 context-switches # 0.089 K/sec 18 CPU-migrations # 0.001 K/sec 335 page-faults # 0.012 K/sec 65,076,127,595 cycles # 2.253 GHz 41,032,528,741 instructions # 0.63 insns per cycle 6,560,579,013 branches # 227.183 M/sec 1,646,394,749 branch-misses # 25.10% of all branches 28.935500947 seconds time elapsed It can also do source code annotation with dissassembly. perf record -e branch-misses ./sumtest_unsorted perf annotate -d sumtest_unsorted <div></div> Percent | Source code & Disassembly of sumtest_unsorted ------------------------------------------------ ... : sum += data[c]; 0.00 : 400a1a: mov -0x14(%rbp),%eax 39.97 : 400a1d: mov %eax,%eax 5.31 : 400a1f: mov -0x20040(%rbp,%rax,4),%eax 4.60 : 400a26: cltq 0.00 : 400a28: add %rax,-0x30(%rbp) ... See [the performance tutorial][1] for more details. [1]: https://perf.wiki.kernel.org/index.php/Tutorial