0

enter image description here Code asm diff

I dont really understand what optimization this is, all I know is that it's really fast and I've tried many commands in the manual to no avail. Can anyone explain in detail what optimization this is and what command in GCC generates the same ASM as ICC or better?

4
  • 5
    Please no screenshots of text, thanks. Jan 5, 2023 at 19:10
  • For the optimization, please read: stackoverflow.com/questions/11227809/… (proposed by S.O. itself). For the difference between compilers, this is not possible in general: this is two different compilers so you cannot expect 1 to give the same performance than the other. This is especially true for ICC which is a close-source compiler focussing primarily on high-performance and x86-64 processors (especially Intel ones). ICC is especially good for HPC codes and this is why people pay for it. Jan 5, 2023 at 19:21
  • 1
    ICC is "smart" enough so to transpose the loops. This is typically done by an high-level transformation optimization pass at the AST level (which is quite unique to ICC since AFAIK neither GCC, Clang nor MSVC do that). This transposition help ICC to pay the overhead of the inner loop once instead of 100000 time. That being said, this is not a realistic use-case. Both compilers are apparently not able to remove the outer loop while this is trivial: one just need to multiply the value by 100000. Jan 5, 2023 at 19:34
  • 2
    The way this language reads "to make ... compile programs as fast as each other" makes it sound like compilation speed is at the issue.
    – Erik Eidt
    Jan 6, 2023 at 0:15

2 Answers 2

2

I'm not sure there is an optimization option to make GCC do this optimization. Don't write loops that redo the same work 100k times if you don't want you program to spend time doing that.

Defeating benchmark loops can make compilers look good on benchmark, but AFAIK is often not useful in real-world code where something else happens between runs of the loop you want optimized.


ICC is defeating your benchmark repeat-loop by turning it into this

    for (unsigned c = 0; c < arraySize; ++c)
    {
        if (data[c] >= 128)
            for (unsigned i = 0; i < 100000; ++i)
                sum += data[c];
    }

The first step, swapping the inner and outer loops, is called loop interchange. Making one pass over the array is good for cache locality and enables further optimizations.

Turning for() if() into if() for(){} else for(){} is called loop unswitching. In this case, there is no "else" work to do; the only thing in the loop was if()sum+=..., so it becomes just an if controlling a repeated-addition loop.


ICC unrolls + vectorizes that sum +=, strangely not just doing it with a multiply. Instead it does 100000 64-bit add operations. ymm0 holds _mm256_set1_epi64(data[c]) from vpbroadcastq.

This inner loop only runs conditionally; it's worth branching if it's going to save 6250 iterations of this loop. (Only one pass over the array, one branch per element total, not 100k.)

..B1.9:                         # Preds ..B1.9 ..B1.8
        add       edx, 16                                       #20.2
        vpaddq    ymm4, ymm4, ymm0                              #26.5
        vpaddq    ymm3, ymm3, ymm0                              #26.5
        vpaddq    ymm2, ymm2, ymm0                              #26.5
        vpaddq    ymm1, ymm1, ymm0                              #26.5
        cmp       edx, 100000                                   #20.2
        jb        ..B1.9        # Prob 99%                      #20.2

Every iteration does 16 additions, 4 per instruction, unrolled by 4 into separate accumulators that are reduced to 1 and then hsummed after the loop. Unrolling lets Skylake and later run 3 vpaddq per clock cycle.


By contrast, GCC does multiple passes over the array, inside the loop vectorizing to branchlessly do 8 compares:

.L85:
        vmovdqa ymm4, YMMWORD PTR [rax]      # load 8 ints
        add     rax, 32
        vpcmpgtd        ymm0, ymm4, ymm3     # signed-compare them against 128
        vpand   ymm0, ymm0, ymm4             # mask data[c] to 0 or data[c]
        vpmovsxdq       ymm2, xmm0           # sign-extend to 64-bit
        vextracti128    xmm0, ymm0, 0x1
        vpaddq  ymm1, ymm2, ymm1             # and add
        vpmovsxdq       ymm0, xmm0           # sign-extend the high half and add it
        vpaddq  ymm1, ymm0, ymm1             # to the same accumulator
        cmp     rbx, rax
        jne     .L85

This is inside a repeat loop that makes multiple passes over the array, and might bottleneck on 1 shuffle uop per clock, like 8 elements per 3 clock cycles on Skylake.

So it just vectorized the inner if() sum+=data[c] like you'd expect, without defeating the repeat loop at all. Clang does similar, as in Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

0

compilers generate code that is functionally equivalent, there is no reason to assume there is one perfect output for a target from one input. if two compilers produced the same output for a relatively decent sized function/project then one is derived from the other, legally or not.

In general no reason to assume any two compilers generate the same output or any two versions of the same compiler generates the same output. Amplify that by command line options that will change the output of the compiler.

In general the expectation is that for your code one compiler may produce "better" code depending on the definition of better. Smaller, or runs faster on one particular computer, operating system, etc.

gcc is a generic computer it does "okay" for each target but is not great for any target. Some tools that are designed from scratch to aim at one target/system CAN (but not necessarily) do better. And then there can be some cheating. Take some code like the possibly intentionally horribly written dhrystone and whetstone, etc. Then when X compiler that perhaps you already paid (five figures) for or are evaluating to pay for, does not produce code for dhrystone that is as fast as the free tool. Oh, sure, try this command line option for dhrystone. Hmm okay that does work much better (been there, seen this). gcc has been getting worse since versions 3.x.x/4.x.x. for various reasons I assume. I assume that part of it is the folks that truly worked at this level are dying off and being replaced with folks without the low level experience and skills. Processors are getting more complicated and gcc and others have more targets. But the volume of missed optimizations that older versions provided are increasing, and the size of the binaries with the same source and same settings are increasing, by a significant amount, not just a tiny bit for a decent sized project.

This is not a case of I need to get the wheel on the car and I have a choice of tools I can use to tighten the nut and it is the same result independent of tool.

no reason to expect that you can get any two compilers to generate the same output, even if the tools both generate assembly language, and generated the same sequence of instructions and data, no reason to assume that the assembly language itself from different assembly languages for that target, to different label names and spacing, function ordering, and other syntax that would make a diff difficult to deal with.

Not the answer you're looking for? Browse other questions tagged or ask your own question.