copy-edited
Source Link
Deduplicator
  • 43.1k
  • 6
  • 62
  • 109

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):

Scenario Time (seondsseconds)
Branching - Random data 8.885
Branching - Sorted data 1.528
Branchless - Random data 3.716
Branchless - Sorted data 3.71
Scenario Time (seondsseconds)
Branching - Random data 11.302
Branching - Sorted data 1.830
Branchless - Random data 2.736
Branchless - Sorted data 2.737

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):

Scenario Time (seonds)
Branching - Random data 8.885
Branching - Sorted data 1.528
Branchless - Random data 3.716
Branchless - Sorted data 3.71
Scenario Time (seonds)
Branching - Random data 11.302
Branching - Sorted data 1.830
Branchless - Random data 2.736
Branchless - Sorted data 2.737

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is:

Scenario Time (seconds)
Branching - Random data 8.885
Branching - Sorted data 1.528
Branchless - Random data 3.716
Branchless - Sorted data 3.71
Scenario Time (seconds)
Branching - Random data 11.302
Branching - Sorted data 1.830
Branchless - Random data 2.736
Branchless - Sorted data 2.737
copy-edited
Source Link
Deduplicator
  • 43.1k
  • 6
  • 62
  • 109

The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial's answerMysticial's answer.

On an Intel Core i7Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):

//  Branch - Random
seconds = 8.885

//  Branch - Sorted
seconds = 1.528

//  Branchless - Random
seconds = 3.716

//  Branchless - Sorted
seconds = 3.71
ScenarioTime (seonds)
Branching - Random data8.885
Branching - Sorted data1.528
Branchless - Random data3.716
Branchless - Sorted data3.71
//  Branch - Random
seconds = 11.302

//  Branch - Sorted
 seconds = 1.830

//  Branchless - Random
seconds = 2.736

//  Branchless - Sorted
seconds = 2.737
ScenarioTime (seonds)
Branching - Random data11.302
Branching - Sorted data1.830
Branchless - Random data2.736
Branchless - Sorted data2.737

In a typical x86 processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipeliningpipelining.

The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial's answer.

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):

//  Branch - Random
seconds = 8.885

//  Branch - Sorted
seconds = 1.528

//  Branchless - Random
seconds = 3.716

//  Branchless - Sorted
seconds = 3.71
//  Branch - Random
seconds = 11.302

//  Branch - Sorted
 seconds = 1.830

//  Branchless - Random
seconds = 2.736

//  Branchless - Sorted
seconds = 2.737

In a typical x86 processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipelining.

The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial's answer.

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):

ScenarioTime (seonds)
Branching - Random data8.885
Branching - Sorted data1.528
Branchless - Random data3.716
Branchless - Sorted data3.71
ScenarioTime (seonds)
Branching - Random data11.302
Branching - Sorted data1.830
Branchless - Random data2.736
Branchless - Sorted data2.737

In a typical x86 processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipelining.

In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like Fetch and Decode doesdo not depend on the result of the previous instruction; only latter stages need the result. Thus, we wait a fraction of one instruction's execution time. This is why the conditional move version is slower than the branch when the prediction is easy.

The book Computer Systems: A Programmer's Perspective, second edition explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for a special treatment for Branch Prediction and Misprediction Penalties.

Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between a branch and a conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.

In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like Fetch and Decode does not depend on the result of the previous instruction; only latter stages need the result. Thus, we wait a fraction of one instruction's execution time. This is why the conditional move version is slower than the branch when prediction is easy.

The book Computer Systems: A Programmer's Perspective, second edition explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for a special treatment for Branch Prediction and Misprediction Penalties.

Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between branch and conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.

In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like Fetch and Decode do not depend on the result of the previous instruction; only latter stages need the result. Thus, we wait a fraction of one instruction's execution time. This is why the conditional move version is slower than the branch when the prediction is easy.

The book Computer Systems: A Programmer's Perspective, second edition explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for special treatment for Branch Prediction and Misprediction Penalties.

Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between a branch and a conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.

Linked to the answer instead of the user.
Source Link
Peter Mortensen
  • 29.8k
  • 21
  • 98
  • 124
Loading
replaced http://stackoverflow.com/ with https://stackoverflow.com/
Source Link
URL Rewriter Bot
URL Rewriter Bot
Loading
Rollback to Revision 12
Source Link
user1306322
  • 8.2k
  • 15
  • 57
  • 117
Loading
added 4 characters in body
Source Link
Durgpal Singh
  • 9.5k
  • 4
  • 34
  • 47
Loading
Some copy editing (ref. <http://english.stackexchange.com/questions/15953>).
Source Link
Peter Mortensen
  • 29.8k
  • 21
  • 98
  • 124
Loading
Mod Removes Wiki by Shog9
Post Made Community Wiki by George Stocker
added 43 characters in body
Source Link
Marco Bonelli
  • 54.7k
  • 20
  • 104
  • 110
Loading
Bounty Ended with 150 reputation awarded by DatEpicCoderGuyWhoPrograms
Rollback to Revision 8
Source Link
Yaakov Ellis
  • 39.3k
  • 26
  • 124
  • 172
Loading
Clarify details, meaning not changed
Source Link
Loading
Copy edited. Named a link.
Source Link
Peter Mortensen
  • 29.8k
  • 21
  • 98
  • 124
Loading
Added conditional move instruction explanation and some others.
Source Link
WiSaGaN
  • 44.1k
  • 9
  • 52
  • 83
Loading
Minor correction.
Source Link
WiSaGaN
  • 44.1k
  • 9
  • 52
  • 83
Loading
Rephrase.
Source Link
WiSaGaN
  • 44.1k
  • 9
  • 52
  • 83
Loading
Corrected optimization-level.
Source Link
WiSaGaN
  • 44.1k
  • 9
  • 52
  • 83
Loading
The original VS2010 results were compiled in x86 mode instead of x86_64, corrected it and added x86_64 results.
Source Link
WiSaGaN
  • 44.1k
  • 9
  • 52
  • 83
Loading
Added explanation.
Source Link
WiSaGaN
  • 44.1k
  • 9
  • 52
  • 83
Loading
Source Link
WiSaGaN
  • 44.1k
  • 9
  • 52
  • 83
Loading