I know this edit might seem weird, but if you take a bit precise look, that fits best to the example, as admitted by author itself a bit down below.
Source Link
T.Todua
  • 52.4k
  • 19
  • 232
  • 236

You are thea blind operator of a junction and you hear a train coming. You have no idea which way it is supposed to go. You stop the train to ask the driver which direction they want. And then you set the switch appropriately.

You are the operator of a junction and you hear a train coming. You have no idea which way it is supposed to go. You stop the train to ask the driver which direction they want. And then you set the switch appropriately.

You are a blind operator of a junction and you hear a train coming. You have no idea which way it is supposed to go. You stop the train to ask the driver which direction they want. And then you set the switch appropriately.

Bounty Ended with 500 reputation awarded by CommunityBot
added 7 characters in body
Source Link

Trains are heavy and have a lot of inertia. So, so they take forever to start up and slow down.

Modern processors are complicated and have long pipelines. SoThis means they take forever to "warm up" and "slow down".

So howHow would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...

Most applications have well-behaved branches. SoTherefore, modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

So whatWhat can be done?

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So, so there is no difference between the sorted and unsorted data - both are fast.

    (Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)

  • VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

  • Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So notNot only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

  • If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Trains are heavy and have a lot of inertia. So they take forever to start up and slow down.

Modern processors are complicated and have long pipelines. So they take forever to "warm up" and "slow down".

So how would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...

Most applications have well-behaved branches. So modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

So what can be done?

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

    (Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)

  • VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

  • Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

  • If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Trains are heavy and have a lot of inertia, so they take forever to start up and slow down.

Modern processors are complicated and have long pipelines. This means they take forever to "warm up" and "slow down".

How would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...

Most applications have well-behaved branches. Therefore, modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

What can be done?

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move, so there is no difference between the sorted and unsorted data - both are fast.

    (Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)

  • VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

  • Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. Not only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

  • If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

"the branch predictors". No, we're not talking about any specific branch predictors in that sentence, just roll that back and remove the extra word instead of changing it. And although warm-up is normally good, it clashes with "slow down".
Source Link
Peter Cordes
  • 320.9k
  • 45
  • 592
  • 826

Modern processors are complicated and have long pipelines. So they take forever to "warm-up" up" and "slow down".

In other words, you try to identify a pattern and follow it. This is more or less how the branch predictors work.

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

    (Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)

  • VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

  • Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune to the mispredictions, but it isit's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

  • If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Modern processors are complicated and have long pipelines. So they take forever to "warm-up" and "slow down".

In other words, you try to identify a pattern and follow it. This is more or less how the branch predictors work.

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

    (Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)

  • VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

  • Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune to the mispredictions, but it is also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

  • If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Modern processors are complicated and have long pipelines. So they take forever to "warm up" and "slow down".

In other words, you try to identify a pattern and follow it. This is more or less how branch predictors work.

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

    (Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)

  • VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

  • Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

  • If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).

Some grammatical, spelling and punctuation mistakes corrected.
Source Link
Pawel Veselov
  • 3.9k
  • 7
  • 44
  • 62
Loading
Don't use protocol-relative URLs. See: https://nickcraver.com/blog/2017/05/22/https-on-stack-overflow/#mistakes-protocol-relative-urls
Source Link
Erwin Brandstetter
  • 594.2k
  • 142
  • 1055
  • 1212
Loading
copy-edited
Source Link
Deduplicator
  • 44.4k
  • 7
  • 66
  • 116
Loading
Rollback to Revision 41
Source Link
Ryan Lundy
  • 203.4k
  • 37
  • 179
  • 211
Loading
fixed grammar
Source Link
Abhishek Bhagate
  • 5.5k
  • 3
  • 15
  • 32
Loading
Commonmark migration
Source Link
Loading
Rollback to Revision 38 - None of the previous edit was an improvement. Removing "on" was not necessary, but not harmful. Some of the other changes were harmful, like introducing a "but" between two things that are both positive, and a "the" before branch predictors. "Is able to" also suits that context better than "can".
Source Link
Peter Cordes
  • 320.9k
  • 45
  • 592
  • 826
Loading
deleted 12 characters in body
Source Link
Loading
Bounty Ended with 100 reputation awarded by CommunityBot
cmov can be slower than branchy, especially when GCC does it wrong. Link a followup Q&A
Source Link
Peter Cordes
  • 320.9k
  • 45
  • 592
  • 826
Loading
added 3 characters in body
Source Link
Cœur
  • 36.9k
  • 25
  • 192
  • 262
Loading
Bounty Ended with 50 reputation awarded by Meraj al Maksud
Second iteration.
Source Link
Peter Mortensen
  • 31k
  • 21
  • 104
  • 130
Loading
Active reading [<http://en.wikipedia.org/wiki/NetBeans> <https://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler>]. Introduced abbr. "ICC" - "Intel C++ Compiler" is a proper noun.
Source Link
Peter Mortensen
  • 31k
  • 21
  • 104
  • 130
Loading
Rollback to Revision 30
Source Link
Cody Gray
  • 237.9k
  • 50
  • 486
  • 572
Loading
added 2 characters in body
Source Link
Alec
  • 8.4k
  • 8
  • 35
  • 62
Loading
Rollback to Revision 30
Source Link
royhowie
  • 11k
  • 14
  • 49
  • 67
Loading
In each case the image is described by the text above it, thus the alt text is redundant. Testing an edit re: https://meta.stackoverflow.com/questions/382581/i-couldnt-submit-my-edit-on-an-answer-due-to-unformatted-code-yet-i-didnt-edi
Source Link
BSMP
  • 4.6k
  • 8
  • 33
  • 44
Loading
added meaningful texts as image alts
Source Link
Neuron
  • 5k
  • 5
  • 38
  • 58
Loading
Rollback to Revision 27
Source Link
cs95
  • 373k
  • 95
  • 688
  • 739
Loading