0

I was inspired by this question and wondering whether it's possible to use multiple SIMD instructions at the same time, since a CPU core may have multiple vector processing unit (page 5 of this slides).

The code is:

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
    return 0;
}

The assembly code compiled: compiled for AVX512 and compiled for AVX2

After inspecting the assembly code, I discovered that the inner loop (array traversal) was vectorized. In the case of AVX512 (-march=knl, knights landing), each step consists of processing 64 elements, by calling 8 SIMD instructions, each adding 8 elements to the previous result.

The intermediate result is stored in 4 zmm registers, each consisting of 8 elements. Finally 4 zmm registers will be reduced to a single result sum. It seems these SIMD instructions are called serially because it uses the same zmm5 register to store intermediate variable.

a piece of assembly:

# 4 SIMD
vpmovzxdq       zmm5, ymm5    # extends 8 elements from int (32) to long long (64)          
vpaddq  zmm1, zmm1, zmm5      # add to the previous result

vpmovzxdq       zmm5, ymm6    # They are using the same zmm5 register           
vpaddq  zmm2, zmm2, zmm5      # so I think they are not parallelized

vpmovzxdq       zmm5, ymm7              
vpaddq  zmm3, zmm3, zmm5

vpmovzxdq       zmm5, ymm8              
vpaddq  zmm4, zmm4, zmm5

# intermediate result stored in zmm1~zmm4
# read additional 32 elements and repeat the above routine once
# in total 8 SIMD and 64 elements in each FOR step after compilation

My questions is, according to Intel, Knights Landing CPU have 2 vector processing units for each core (page 5 of this slides). Therefore, would it be possible to do 2 AVX512 SIMD computation at the same time?

3
  • 1
    Even KNL does some small amount of register-renaming for vector regs, allowing these independent uses of the same architectural register to overlap. The loop-carried dependencies use separate accumulator regs. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) the top of my answer on re: reg renaming. Dec 8, 2021 at 9:41
  • (And BTW, are you actually tuning for discontinued Xeon Phi compute cards? Normally we'd use -march=skylake-avx512 maybe with -mprefer-vector-width=512 for modern servers, godbolt.org/z/nEbYGYxrr, or -march=icelake-client ) Dec 8, 2021 at 9:44
  • Clang does have a silly missed optimization, though: it uses vmovdqa32 zmm5 {k1} {z}, zmm5 to zero some elements according to the compare result, instead of zero-masking as part of the vpmovzxdq, or better merge-masking as part of the vpaddq (to give more ILP (instruction-level parallelism) by not using the compare result until after the zero-extension, so vpcmeqd and vpmovzxdq can run in parallel even for the same load result, maybe needing less loop unrolling). Of course if the compiler knew the data was all <= 256, it wouldn't need to widen to avoid overflow. Dec 8, 2021 at 9:57

0