Revisions to Why is processing a sorted array faster than processing an unsorted array?

I know this edit might seem weird, but if you take a bit precise look, that fits best to the example, as admitted by author itself a bit down below.

Source Link

edited Aug 20, 2022 at 16:47

T.Todua

52.4k
19
232
236

You are thea blind operator of a junction and you hear a train coming. You have no idea which way it is supposed to go. You stop the train to ask the driver which direction they want. And then you set the switch appropriately.

Bounty Ended with 500 reputation awarded by CommunityBot

occurred Jan 21, 2022 at 2:37

added 7 characters in body

Source Link

edited Jul 25, 2021 at 22:06

Zoe stands with Ukraine ♦

26.8k
21
115
149

Trains are heavy and have a lot of inertia. So, so they take forever to start up and slow down.

Modern processors are complicated and have long pipelines. SoThis means they take forever to "warm up" and "slow down".

So howHow would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...

Most applications have well-behaved branches. SoTherefore, modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

So whatWhat can be done?

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So, so there is no difference between the sorted and unsorted data - both are fast.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So notNot only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Trains are heavy and have a lot of inertia. So they take forever to start up and slow down.

Modern processors are complicated and have long pipelines. So they take forever to "warm up" and "slow down".

So how would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...

Most applications have well-behaved branches. So modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

So what can be done?

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Trains are heavy and have a lot of inertia, so they take forever to start up and slow down.

Modern processors are complicated and have long pipelines. This means they take forever to "warm up" and "slow down".

How would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...

Most applications have well-behaved branches. Therefore, modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

What can be done?

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move, so there is no difference between the sorted and unsorted data - both are fast.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. Not only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

"the branch predictors". No, we're not talking about any specific branch predictors in that sentence, just roll that back and remove the extra word instead of changing it. And although warm-up is normally good, it clashes with "slow down".

Source Link

edited Jun 12, 2021 at 0:43

Peter Cordes

320.9k
45
592
826

Modern processors are complicated and have long pipelines. So they take forever to "warm-up" up" and "slow down".

In other words, you try to identify a pattern and follow it. This is more or less how the branch predictors work.

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune to the mispredictions, but it isit's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Modern processors are complicated and have long pipelines. So they take forever to "warm-up" and "slow down".

In other words, you try to identify a pattern and follow it. This is more or less how the branch predictors work.

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune to the mispredictions, but it is also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Modern processors are complicated and have long pipelines. So they take forever to "warm up" and "slow down".

In other words, you try to identify a pattern and follow it. This is more or less how branch predictors work.

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Some grammatical, spelling and punctuation mistakes corrected.

Source Link

edited Jun 11, 2021 at 16:00

Pawel Veselov

3.9k
7
44
62

Loading

Some Grammatical Mistakes corrected.

Source Link

edit approved Jun 11, 2021 at 16:00

Kashif Iftikhar

284
1
9
28

Loading

Don't use protocol-relative URLs. See: https://nickcraver.com/blog/2017/05/22/https-on-stack-overflow/#mistakes-protocol-relative-urls

Source Link

edited May 11, 2021 at 21:58

Erwin Brandstetter

594.2k
142
1055
1212

Loading

copy-edited

Source Link

edited Jan 10, 2021 at 15:35

Deduplicator

44.4k
7
66
116

Loading

Rollback to Revision 41

Source Link

edited Jul 22, 2020 at 13:59

Ryan Lundy

203.4k
37
179
211

Loading

fixed grammar

Source Link

edited Jul 14, 2020 at 19:01

Abhishek Bhagate

5.5k
3
15
32

Loading

Commonmark migration

Source Link

edited Jun 20, 2020 at 9:12

Community Bot

1
1

Loading

Rollback to Revision 38 - None of the previous edit was an improvement. Removing "on" was not necessary, but not harmful. Some of the other changes were harmful, like introducing a "but" between two things that are both positive, and a "the" before branch predictors. "Is able to" also suits that context better than "can".

Source Link

edited Mar 26, 2020 at 10:57

Peter Cordes

320.9k
45
592
826

Loading

deleted 12 characters in body

Source Link

edited Mar 26, 2020 at 10:31

Arsen Khachaturyan

7.8k
4
40
42

Loading

Bounty Ended with 100 reputation awarded by CommunityBot

occurred Jan 29, 2020 at 21:34

cmov can be slower than branchy, especially when GCC does it wrong. Link a followup Q&A

Source Link

edited Dec 28, 2019 at 2:46

Peter Cordes

320.9k
45
592
826

Loading

added 3 characters in body

Source Link

edited Dec 28, 2019 at 1:25

Cœur

36.9k
25
192
262

Loading

Bounty Ended with 50 reputation awarded by Meraj al Maksud

occurred Nov 15, 2019 at 18:07

Second iteration.

Source Link

edited May 27, 2019 at 12:42

Peter Mortensen

31k
21
104
130

Loading

Active reading [<http://en.wikipedia.org/wiki/NetBeans> <https://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler>]. Introduced abbr. "ICC" - "Intel C++ Compiler" is a proper noun.

Source Link

edited May 27, 2019 at 12:36

Peter Mortensen

31k
21
104
130

Loading

Rollback to Revision 30

Source Link

edited May 4, 2019 at 19:55

Cody Gray ♦

237.9k
50
486
572

Loading

added 2 characters in body

Source Link

edited May 2, 2019 at 18:43

Alec

8.4k
8
35
62

Loading

Rollback to Revision 30

Source Link

edited Apr 10, 2019 at 4:49

royhowie

11k
14
49
67

Loading

In each case the image is described by the text above it, thus the alt text is redundant. Testing an edit re: https://meta.stackoverflow.com/questions/382581/i-couldnt-submit-my-edit-on-an-answer-due-to-unformatted-code-yet-i-didnt-edi

Source Link