Why does sorting make this branchless code faster?

Question

Edit 2: Seems like occasionally my CPU runs unsorted as fast as sorted. On other machines they're consistently the same speed. I guess I'm not benchmarking correctly or there's subtle things going on behind the scene

Edit: ASM code below. The generated code is the same in the main loop. Can someone confirm if it's faster or the same? I'm using clang 10 and I'm on ryzen 3600. According to compiler explorer there is no branches, the adding uses a SIMD instruction that adds either A or B based on the mask. The mask is from the >= 128 and B is a vector of 0's. So no there is no hidden branch from what I can see.

I thought this very popular question was silly and tried it in clang Why is processing a sorted array faster than processing an unsorted array?

With g++ -O2 it takes 2seconds with clang it took 0.32. I tried making it branchless like the below and found that even though it's branchless sorting still makes it faster. Whats going on?! (also it seems like O2 vectorize the code so I didn't try to do more)

#include <algorithm>
#include <ctime>
#include <stdio.h>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            bool v = data[c] >= 128;
            sum += data[c] * v;
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    printf("%f\nsum = %lld\n", elapsedTime, sum);
}

.LBB0_4:                                #   Parent Loop BB0_3 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
    vmovdqa (%rsp,%rcx,4), %xmm5
    vmovdqa 16(%rsp,%rcx,4), %xmm6
    vmovdqa 32(%rsp,%rcx,4), %xmm7
    vmovdqa 48(%rsp,%rcx,4), %xmm0
    addq    $16, %rcx
    vpcmpgtd    %xmm8, %xmm5, %xmm9
    vpcmpgtd    %xmm8, %xmm6, %xmm10
    vpcmpgtd    %xmm8, %xmm7, %xmm11
    vpcmpgtd    %xmm8, %xmm0, %xmm12
    vpand   %xmm5, %xmm9, %xmm5
    vpand   %xmm0, %xmm12, %xmm0
    vpand   %xmm6, %xmm10, %xmm6
    vpand   %xmm7, %xmm11, %xmm7
    vpmovsxdq   %xmm6, %ymm9
    vpmovsxdq   %xmm5, %ymm5
    vpmovsxdq   %xmm7, %ymm6
    vpmovsxdq   %xmm0, %ymm0
    vpaddq  %ymm5, %ymm1, %ymm1
    vpaddq  %ymm2, %ymm9, %ymm2
    vpaddq  %ymm6, %ymm3, %ymm3
    vpaddq  %ymm0, %ymm4, %ymm4
    cmpq    $32768, %rcx            # imm = 0x8000
    jne .LBB0_4

bool v = data[c] >= 128; sum += data[c] * v; is still a branch ... you should compare the assembly to see what its doing ... — ChrisMM, Apr 25, 2020 at 1:09
@ChrisMM I deleted my comments to clean up a bit. According to intel that and the other instructions all have a fixed Latency and Throughput. software.intel.com/sites/landingpage/IntrinsicsGuide/… — Eric Stotch, Apr 25, 2020 at 1:44
I'm not an Intel engineer, but I'd imagine these instructions still use predictions. The micro code would implement an if/else. — ChrisMM, Apr 25, 2020 at 1:52
The VPS server I use (VPS is shared so not very consistent) is a Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz. It runs BOTH at the same speed (tried with same binary and tried by recompiling the source). My desktop (AMD Ryzen 5 3600 6-Core Processor) seems to occasionally run the unsorted one as fast as the sorted one but it seems to usually be slower. Strange. I'm probably just bad at benchmarking — Eric Stotch, Apr 25, 2020 at 2:08
Switch to using the high resolution clock in chrono library. Should be more accurate than clock function — ChrisMM, Apr 25, 2020 at 2:19

Stephen Newell · Accepted Answer · 2020-04-25 01:11:20Z

It's almost certainly branch prediction. The hardware will try to figure out how your conditions will evaluate and get that branch of code ready, and one method it leverages is previous iterations through loops. Since your sorted array means it's going to have a bunch of false paths, then a bunch of true paths, this means the only time it's going to miss is when the loop starts and when data[c] first becomes >= 128.

In addition, it's possible the compiler is smart enough to optimize on a sorted array. I haven't tested, but it could save the magic c value, do a lookup on that once, and then have fewer iterations of your inner for. This would be an optimization you could write yourself as well.

Collectives™ on Stack Overflow

Why does sorting make this branchless code faster?

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
c++
performance
branchless
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c++performancebranchless or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c++
performance
branchless
or ask your own question.