Revisions to Why is processing a sorted array faster than processing an unsorted array?

Bounty Ended with 500 reputation awarded by richardec

occurred Jan 21 at 2:37

added 7 characters in body

Source Link

edited Jul 25, 2021 at 22:06

24.9k
17
110
147

Trains are heavy and have a lot of inertia. So, so they take forever to start up and slow down.

Modern processors are complicated and have long pipelines. SoThis means they take forever to "warm up" and "slow down".

So howHow would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...

Most applications have well-behaved branches. SoTherefore, modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

So whatWhat can be done?

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So, so there is no difference between the sorted and unsorted data - both are fast.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So notNot only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Trains are heavy and have a lot of inertia. So they take forever to start up and slow down.

Modern processors are complicated and have long pipelines. So they take forever to "warm up" and "slow down".

So how would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...

Most applications have well-behaved branches. So modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

So what can be done?

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Trains are heavy and have a lot of inertia, so they take forever to start up and slow down.

Modern processors are complicated and have long pipelines. This means they take forever to "warm up" and "slow down".

How would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...

Most applications have well-behaved branches. Therefore, modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

What can be done?

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move, so there is no difference between the sorted and unsorted data - both are fast.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. Not only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

"the branch predictors". No, we're not talking about any specific branch predictors in that sentence, just roll that back and remove the extra word instead of changing it. And although warm-up is normally good, it clashes with "slow down".

Source Link

edited Jun 12, 2021 at 0:43

Peter Cordes

276.9k
41
500
708

Modern processors are complicated and have long pipelines. So they take forever to "warm-up" up" and "slow down".

In other words, you try to identify a pattern and follow it. This is more or less how the branch predictors work.

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune to the mispredictions, but it isit's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Modern processors are complicated and have long pipelines. So they take forever to "warm-up" and "slow down".

In other words, you try to identify a pattern and follow it. This is more or less how the branch predictors work.

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune to the mispredictions, but it is also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Modern processors are complicated and have long pipelines. So they take forever to "warm up" and "slow down".

In other words, you try to identify a pattern and follow it. This is more or less how branch predictors work.

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Some grammatical, spelling and punctuation mistakes corrected.

Source Link

edited Jun 11, 2021 at 16:00

Pawel Veselov

3.7k
6
40
57

In other words, you try to identify a pattern and follow it. This is more or less how tothe branch predictors work.

Some Grammatical Mistakes corrected.

Source Link

edit approved Jun 11, 2021 at 16:00

Kashif Iftikhar

226
1
9
25

Loading

Don't use protocol-relative URLs. See: https://nickcraver.com/blog/2017/05/22/https-on-stack-overflow/#mistakes-protocol-relative-urls

Source Link

edited May 11, 2021 at 21:58

Erwin Brandstetter

525.1k
121
963
1121

Loading

copy-edited

Source Link

edited Jan 10, 2021 at 15:35

Deduplicator

43.1k
6
62
109

Loading

Rollback to Revision 41

Source Link

edited Jul 22, 2020 at 13:59

Ryan Lundy

194.4k
36
176
207

Loading

fixed grammar

Source Link

edited Jul 14, 2020 at 19:01

Abhishek Bhagate

5.2k
3
11
31

Loading

Commonmark migration

Source Link

edited Jun 20, 2020 at 9:12

Community Bot

1
1

Loading

Rollback to Revision 38 - None of the previous edit was an improvement. Removing "on" was not necessary, but not harmful. Some of the other changes were harmful, like introducing a "but" between two things that are both positive, and a "the" before branch predictors. "Is able to" also suits that context better than "can".

Source Link

edited Mar 26, 2020 at 10:57

Peter Cordes

276.9k
41
500
708

Loading

deleted 12 characters in body

Source Link

edited Mar 26, 2020 at 10:31

Arsen Khachaturyan

7.2k
4
36
38

Loading

Bounty Ended with 100 reputation awarded by CommunityBot

occurred Jan 29, 2020 at 21:34

cmov can be slower than branchy, especially when GCC does it wrong. Link a followup Q&A

Source Link

edited Dec 28, 2019 at 2:46

Peter Cordes

276.9k
41
500
708

Loading

added 3 characters in body

Source Link

edited Dec 28, 2019 at 1:25

Cœur

34.1k
23
184
247

Loading

Bounty Ended with 50 reputation awarded by Meraj al Maksud

occurred Nov 15, 2019 at 18:07

Second iteration.

Source Link

edited May 27, 2019 at 12:42

Peter Mortensen

29.8k
21
98
124

Loading

Active reading [<http://en.wikipedia.org/wiki/NetBeans> <https://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler>]. Introduced abbr. "ICC" - "Intel C++ Compiler" is a proper noun.

Source Link

edited May 27, 2019 at 12:36

Peter Mortensen

29.8k
21
98
124

Loading

Rollback to Revision 30

Source Link

edited May 4, 2019 at 19:55

Cody Gray ♦

229k
49
476
554

Loading

added 2 characters in body

Source Link

edited May 2, 2019 at 18:43

Alec

7k
7
26
52

Loading

Rollback to Revision 30

Source Link

edited Apr 10, 2019 at 4:49

royhowie

10.7k
14
48
67

Loading

In each case the image is described by the text above it, thus the alt text is redundant. Testing an edit re: https://meta.stackoverflow.com/questions/382581/i-couldnt-submit-my-edit-on-an-answer-due-to-unformatted-code-yet-i-didnt-edi

Source Link