Revisions to Why is processing a sorted array faster than processing an unsorted array?

copy-edited

Source Link

edited Jan 10, 2021 at 17:22

43.1k
6
62
109

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):

Scenario	Time (seondsseconds)
Branching - Random data	8.885
Branching - Sorted data	1.528
Branchless - Random data	3.716
Branchless - Sorted data	3.71

Scenario	Time (seondsseconds)
Branching - Random data	11.302
Branching - Sorted data	1.830
Branchless - Random data	2.736
Branchless - Sorted data	2.737

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):

Scenario	Time (seonds)
Branching - Random data	8.885
Branching - Sorted data	1.528
Branchless - Random data	3.716
Branchless - Sorted data	3.71

Scenario	Time (seonds)
Branching - Random data	11.302
Branching - Sorted data	1.830
Branchless - Random data	2.736
Branchless - Sorted data	2.737

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is:

Scenario	Time (seconds)
Branching - Random data	8.885
Branching - Sorted data	1.528
Branchless - Random data	3.716
Branchless - Sorted data	3.71

Scenario	Time (seconds)
Branching - Random data	11.302
Branching - Sorted data	1.830
Branchless - Random data	2.736
Branchless - Sorted data	2.737

copy-edited

Source Link

edited Jan 10, 2021 at 15:45

Deduplicator

43.1k
6
62
109

The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial's answer Mysticial's answer.

On an Intel Core i7 Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):

//  Branch - Random
seconds = 8.885

//  Branch - Sorted
seconds = 1.528

//  Branchless - Random
seconds = 3.716

//  Branchless - Sorted
seconds = 3.71

Scenario	Time (seonds)
Branching - Random data	8.885
Branching - Sorted data	1.528
Branchless - Random data	3.716
Branchless - Sorted data	3.71

//  Branch - Random
seconds = 11.302

//  Branch - Sorted
 seconds = 1.830

//  Branchless - Random
seconds = 2.736

//  Branchless - Sorted
seconds = 2.737

Scenario	Time (seonds)
Branching - Random data	11.302
Branching - Sorted data	1.830
Branchless - Random data	2.736
Branchless - Sorted data	2.737

In a typical x86 processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipelining pipelining.

The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial's answer.

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):

//  Branch - Random
seconds = 8.885

//  Branch - Sorted
seconds = 1.528

//  Branchless - Random
seconds = 3.716

//  Branchless - Sorted
seconds = 3.71

//  Branch - Random
seconds = 11.302

//  Branch - Sorted
 seconds = 1.830

//  Branchless - Random
seconds = 2.736

//  Branchless - Sorted
seconds = 2.737

In a typical x86 processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipelining.

The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial's answer.

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):

Scenario	Time (seonds)
Branching - Random data	8.885
Branching - Sorted data	1.528
Branchless - Random data	3.716
Branchless - Sorted data	3.71

Scenario	Time (seonds)
Branching - Random data	11.302
Branching - Sorted data	1.830
Branchless - Random data	2.736
Branchless - Sorted data	2.737

In a typical x86 processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipelining.

minor edit; fixed grammar.

Source Link

edit approved Oct 13, 2020 at 21:47

Nil

302
1
13

In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like Fetch and Decode doesdo not depend on the result of the previous instruction; only latter stages need the result. Thus, we wait a fraction of one instruction's execution time. This is why the conditional move version is slower than the branch when the prediction is easy.

The book Computer Systems: A Programmer's Perspective, second edition explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for a special treatment for Branch Prediction and Misprediction Penalties.

Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between a branch and a conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.

In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like Fetch and Decode does not depend on the result of the previous instruction; only latter stages need the result. Thus, we wait a fraction of one instruction's execution time. This is why the conditional move version is slower than the branch when prediction is easy.

The book Computer Systems: A Programmer's Perspective, second edition explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for a special treatment for Branch Prediction and Misprediction Penalties.

Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between branch and conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.

In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like Fetch and Decode do not depend on the result of the previous instruction; only latter stages need the result. Thus, we wait a fraction of one instruction's execution time. This is why the conditional move version is slower than the branch when the prediction is easy.

The book Computer Systems: A Programmer's Perspective, second edition explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for special treatment for Branch Prediction and Misprediction Penalties.

Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between a branch and a conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.