Revision 7dac4ae2-9d3a-4c1c-91ce-f8c800895fd7

No doubt some of us would be interested in ways of identifying code that is problematic for the CPU's branch-predictor. The Valgrind tool `cachegrind` has a branch-predictor simulator, enabled by using the `--branch-sim=yes` flag. Running it over the examples in this question, with the number of outer loops reduced to 10000 and compiled with `g++`, gives these results:

**Sorted:**

    ==32551== Branches:        656,645,130  (  656,609,208 cond +    35,922 ind)
    ==32551== Mispredicts:         169,556  (      169,095 cond +       461 ind)
    ==32551== Mispred rate:            0.0% (          0.0%     +       1.2%   )

**Unsorted:**

    ==32555== Branches:        655,996,082  (  655,960,160 cond +  35,922 ind)
    ==32555== Mispredicts:     164,073,152  (  164,072,692 cond +     460 ind)
    ==32555== Mispred rate:           25.0% (         25.0%     +     1.2%   )

Drilling down into the line-by-line output produced by `cg_annotate` we see for the loop in question:

**Sorted:**

              Bc    Bcm Bi Bim
          10,001      4  0   0      for (unsigned i = 0; i < 10000; ++i)
               .      .  .   .      {
               .      .  .   .          // primary loop
     327,690,000 10,016  0   0          for (unsigned c = 0; c < arraySize; ++c)
               .      .  .   .          {
     327,680,000 10,006  0   0              if (data[c] >= 128)
               0      0  0   0                  sum += data[c];
               .      .  .   .          }
               .      .  .   .      }

**Unsorted:**

              Bc         Bcm Bi Bim
          10,001           4  0   0      for (unsigned i = 0; i < 10000; ++i)
               .           .  .   .      {
               .           .  .   .          // primary loop
     327,690,000      10,038  0   0          for (unsigned c = 0; c < arraySize; ++c)
               .           .  .   .          {
     327,680,000 164,050,007  0   0              if (data[c] >= 128)
               0           0  0   0                  sum += data[c];
               .           .  .   .          }
               .           .  .   .      }

This lets you easily identify the problematic line - in the unsorted version the `if (data[c] >= 128)` line is causing 164,050,007 mispredicted conditional branches (`Bcm`) under cachegrind's branch-predictor model, whereas it's only causing 10,006 in the sorted version.

----------

Alternatively, on Linux you can use the performance counters subsystem to accomplish the same task, but with native performance using CPU counters.

    perf stat ./sumtest_sorted

**Sorted:**

     Performance counter stats for './sumtest_sorted':

      11808.095776 task-clock                #    0.998 CPUs utilized          
             1,062 context-switches          #    0.090 K/sec                  
                14 CPU-migrations            #    0.001 K/sec                  
               337 page-faults               #    0.029 K/sec                  
    26,487,882,764 cycles                    #    2.243 GHz                    
    41,025,654,322 instructions              #    1.55  insns per cycle        
     6,558,871,379 branches                  #  555.455 M/sec                  
           567,204 branch-misses             #    0.01% of all branches        

      11.827228330 seconds time elapsed

**Unsorted:**

     Performance counter stats for './sumtest_unsorted':

      28877.954344 task-clock                #    0.998 CPUs utilized          
             2,584 context-switches          #    0.089 K/sec                  
                18 CPU-migrations            #    0.001 K/sec                  
               335 page-faults               #    0.012 K/sec                  
    65,076,127,595 cycles                    #    2.253 GHz                    
    41,032,528,741 instructions              #    0.63  insns per cycle        
     6,560,579,013 branches                  #  227.183 M/sec                  
     1,646,394,749 branch-misses             #   25.10% of all branches        

      28.935500947 seconds time elapsed

It can also do source code annotation with dissassembly.

    perf record -e branch-misses ./sumtest_unsorted
    perf annotate -d sumtest_unsorted
<div></div>
     Percent |      Source code & Disassembly of sumtest_unsorted
    ------------------------------------------------
    ...
             :                      sum += data[c];
        0.00 :        400a1a:       mov    -0x14(%rbp),%eax
       39.97 :        400a1d:       mov    %eax,%eax
        5.31 :        400a1f:       mov    -0x20040(%rbp,%rax,4),%eax
        4.60 :        400a26:       cltq   
        0.00 :        400a28:       add    %rax,-0x30(%rbp)
    ...

See [the performance tutorial][1] for more details.

  [1]: https://perf.wiki.kernel.org/index.php/Tutorial