Mod Removes Wiki by Shog9
Post Made Community Wiki by George Stocker
Dressed the naked link.
Source Link
Peter Mortensen
  • 30.8k
  • 22
  • 106
  • 131

No doubt some of us would be interested in ways of identifying code that is problematic for the CPU's branch-predictor. TheThe Valgrind tool cachegrind has a branch-predictor simulator, enabled by using the --branch-sim=yes flag. RunningRunning it over the examples in this question, with the number of outer loops reduced to 10000 and compiled with g++, gives these results:

See the perf tutorial for more details: https://perf.wiki.kernel.org/index.php/Tutorialthe performance tutorial for more details.

No doubt some of us would be interested in ways of identifying code that is problematic for the CPU's branch-predictor. The Valgrind tool cachegrind has a branch-predictor simulator, enabled by using the --branch-sim=yes flag. Running it over the examples in this question, with the number of outer loops reduced to 10000 and compiled with g++, gives these results:

See the perf tutorial for more details: https://perf.wiki.kernel.org/index.php/Tutorial.

No doubt some of us would be interested in ways of identifying code that is problematic for the CPU's branch-predictor. The Valgrind tool cachegrind has a branch-predictor simulator, enabled by using the --branch-sim=yes flag. Running it over the examples in this question, with the number of outer loops reduced to 10000 and compiled with g++, gives these results:

See the performance tutorial for more details.

Added a section about perf, which may a better choice for linux users than cachegrind.
Source Link

Alternatively, on Linux you can use the performance counters subsystem to accomplish the same task, but with native performance using CPU counters.

perf stat ./sumtest_sorted

Sorted:

 Performance counter stats for './sumtest_sorted':

  11808.095776 task-clock                #    0.998 CPUs utilized          
         1,062 context-switches          #    0.090 K/sec                  
            14 CPU-migrations            #    0.001 K/sec                  
           337 page-faults               #    0.029 K/sec                  
26,487,882,764 cycles                    #    2.243 GHz                    
41,025,654,322 instructions              #    1.55  insns per cycle        
 6,558,871,379 branches                  #  555.455 M/sec                  
       567,204 branch-misses             #    0.01% of all branches        

  11.827228330 seconds time elapsed

Unsorted:

 Performance counter stats for './sumtest_unsorted':

  28877.954344 task-clock                #    0.998 CPUs utilized          
         2,584 context-switches          #    0.089 K/sec                  
            18 CPU-migrations            #    0.001 K/sec                  
           335 page-faults               #    0.012 K/sec                  
65,076,127,595 cycles                    #    2.253 GHz                    
41,032,528,741 instructions              #    0.63  insns per cycle        
 6,560,579,013 branches                  #  227.183 M/sec                  
 1,646,394,749 branch-misses             #   25.10% of all branches        

  28.935500947 seconds time elapsed

It can also do source code annotation with dissassembly.

perf record -e branch-misses ./sumtest_unsorted
perf annotate -d sumtest_unsorted
Percent | Source code & Disassembly of sumtest_unsorted------------------------------------------------...: sum += data[c];0.00: 400a1a: mov-0x14(%rbp),%eax39.97: 400a1d: mov %eax,%eax5.31: 400a1f: mov-0x20040(%rbp,%rax,4),%eax4.60: 400a26: cltq0.00: 400a28: add %rax,-0x30(%rbp)...

See the perf tutorial for more details: https://perf.wiki.kernel.org/index.php/Tutorial.


Alternatively, on Linux you can use the performance counters subsystem to accomplish the same task, but with native performance using CPU counters.

perf stat ./sumtest_sorted

Sorted:

 Performance counter stats for './sumtest_sorted':

  11808.095776 task-clock                #    0.998 CPUs utilized          
         1,062 context-switches          #    0.090 K/sec                  
            14 CPU-migrations            #    0.001 K/sec                  
           337 page-faults               #    0.029 K/sec                  
26,487,882,764 cycles                    #    2.243 GHz                    
41,025,654,322 instructions              #    1.55  insns per cycle        
 6,558,871,379 branches                  #  555.455 M/sec                  
       567,204 branch-misses             #    0.01% of all branches        

  11.827228330 seconds time elapsed

Unsorted:

 Performance counter stats for './sumtest_unsorted':

  28877.954344 task-clock                #    0.998 CPUs utilized          
         2,584 context-switches          #    0.089 K/sec                  
            18 CPU-migrations            #    0.001 K/sec                  
           335 page-faults               #    0.012 K/sec                  
65,076,127,595 cycles                    #    2.253 GHz                    
41,032,528,741 instructions              #    0.63  insns per cycle        
 6,560,579,013 branches                  #  227.183 M/sec                  
 1,646,394,749 branch-misses             #   25.10% of all branches        

  28.935500947 seconds time elapsed

It can also do source code annotation with dissassembly.

perf record -e branch-misses ./sumtest_unsorted
perf annotate -d sumtest_unsorted
Percent | Source code & Disassembly of sumtest_unsorted------------------------------------------------...: sum += data[c];0.00: 400a1a: mov-0x14(%rbp),%eax39.97: 400a1d: mov %eax,%eax5.31: 400a1f: mov-0x20040(%rbp,%rax,4),%eax4.60: 400a26: cltq0.00: 400a28: add %rax,-0x30(%rbp)...

See the perf tutorial for more details: https://perf.wiki.kernel.org/index.php/Tutorial.

Source Link
caf
  • 234.7k
  • 40
  • 324
  • 464

No doubt some of us would be interested in ways of identifying code that is problematic for the CPU's branch-predictor. The Valgrind tool cachegrind has a branch-predictor simulator, enabled by using the --branch-sim=yes flag. Running it over the examples in this question, with the number of outer loops reduced to 10000 and compiled with g++, gives these results:

Sorted:

==32551== Branches:        656,645,130  (  656,609,208 cond +    35,922 ind)
==32551== Mispredicts:         169,556  (      169,095 cond +       461 ind)
==32551== Mispred rate:            0.0% (          0.0%     +       1.2%   )

Unsorted:

==32555== Branches:        655,996,082  (  655,960,160 cond +  35,922 ind)
==32555== Mispredicts:     164,073,152  (  164,072,692 cond +     460 ind)
==32555== Mispred rate:           25.0% (         25.0%     +     1.2%   )

Drilling down into the line-by-line output produced by cg_annotate we see for the loop in question:

Sorted:

          Bc    Bcm Bi Bim
      10,001      4  0   0      for (unsigned i = 0; i < 10000; ++i)
           .      .  .   .      {
           .      .  .   .          // primary loop
 327,690,000 10,016  0   0          for (unsigned c = 0; c < arraySize; ++c)
           .      .  .   .          {
 327,680,000 10,006  0   0              if (data[c] >= 128)
           0      0  0   0                  sum += data[c];
           .      .  .   .          }
           .      .  .   .      }

Unsorted:

          Bc         Bcm Bi Bim
      10,001           4  0   0      for (unsigned i = 0; i < 10000; ++i)
           .           .  .   .      {
           .           .  .   .          // primary loop
 327,690,000      10,038  0   0          for (unsigned c = 0; c < arraySize; ++c)
           .           .  .   .          {
 327,680,000 164,050,007  0   0              if (data[c] >= 128)
           0           0  0   0                  sum += data[c];
           .           .  .   .          }
           .           .  .   .      }

This lets you easily identify the problematic line - in the unsorted version the if (data[c] >= 128) line is causing 164,050,007 mispredicted conditional branches (Bcm) under cachegrind's branch-predictor model, whereas it's only causing 10,006 in the sorted version.