Why does tree vectorization make this sorting algorithm 2x slower?

Question

The sorting algorithm of this question becomes twice faster(!) if -fprofile-arcs is enabled in gcc (4.7.2). The heavily simplified C code of that question (it turned out that I can initialize the array with all zeros, the weird performance behavior remains but it makes the reasoning much much simpler):

#include <time.h>
#include <stdio.h>

#define ELEMENTS 100000

int main() {
  int a[ELEMENTS] = { 0 };
  clock_t start = clock();
  for (int i = 0; i < ELEMENTS; ++i) {
    int lowerElementIndex = i;
    for (int j = i+1; j < ELEMENTS; ++j) {
      if (a[j] < a[lowerElementIndex]) {
        lowerElementIndex = j;
      }
    }
    int tmp = a[i];
    a[i] = a[lowerElementIndex];
    a[lowerElementIndex] = tmp;
  } 
  clock_t end = clock();
  float timeExec = (float)(end - start) / CLOCKS_PER_SEC;
  printf("Time: %2.3f\n", timeExec);
  printf("ignore this line %d\n", a[ELEMENTS-1]);
}

After playing with the optimization flags for a long while, it turned out that -ftree-vectorize also yields this weird behavior so we can take -fprofile-arcs out of the question. After profiling with perf I have found that the only relevant difference is:

Fast case gcc -std=c99 -O2 simp.c (runs in 3.1s)

    cmpl    %esi, %ecx
    jge .L3
    movl    %ecx, %esi
    movslq  %edx, %rdi
.L3:

Slow case gcc -std=c99 -O2 -ftree-vectorize simp.c (runs in 6.1s)

    cmpl    %ecx, %esi
    cmovl   %edx, %edi
    cmovl   %esi, %ecx

As for the first snippet: Given that the array only contains zeros, we always jump to .L3. It can greatly benefit from branch prediction.

I guess the cmovl instructions cannot benefit from branch prediction.

Questions:

Are all my above guesses correct? Does this make the algorithm slow?
If yes, how can I prevent gcc from emitting this instruction (other than the trivial -fno-tree-vectorization workaround of course) but still doing as much optimizations as possible?
What is this -ftree-vectorization? The documentation is quite vague, I would need a little more explanation to understand what's happening.

Update: Since it came up in comments: The weird performance behavior w.r.t. the -ftree-vectorize flag remains with random data. As Yakk points out, for selection sort, it is actually hard to create a dataset that would result in a lot of branch mispredictions.

Since it also came up: I have a Core i5 CPU.

Based on Yakk's comment, I created a test. The code below (online without boost) is of course no longer a sorting algorithm; I only took out the inner loop. Its only goal is to examine the effect of branch prediction: We skip the if branch in the for loop with probability p.

#include <algorithm>
#include <cstdio>
#include <random>
#include <boost/chrono.hpp>
using namespace std;
using namespace boost::chrono;
constexpr int ELEMENTS=1e+8; 
constexpr double p = 0.50;

int main() {
  printf("p = %.2f\n", p);
  int* a = new int[ELEMENTS];
  mt19937 mt(1759);
  bernoulli_distribution rnd(p);
  for (int i = 0 ; i < ELEMENTS; ++i){
    a[i] = rnd(mt)? i : -i;
  }
  auto start = high_resolution_clock::now();
  int lowerElementIndex = 0;
  for (int i=0; i<ELEMENTS; ++i) {
    if (a[i] < a[lowerElementIndex]) {
      lowerElementIndex = i;
    }
  } 
  auto finish = high_resolution_clock::now();
  printf("%ld  ms\n", duration_cast<milliseconds>(finish-start).count());
  printf("Ignore this line   %d\n", a[lowerElementIndex]);
  delete[] a;
}

The loops of interest:

This will be referred to as cmov

g++ -std=c++11 -O2 -lboost_chrono -lboost_system -lrt branch3.cpp

    xorl    %eax, %eax
.L30:
    movl    (%rbx,%rbp,4), %edx
    cmpl    %edx, (%rbx,%rax,4)
    movslq  %eax, %rdx
    cmovl   %rdx, %rbp
    addq    $1, %rax
    cmpq    $100000000, %rax
    jne .L30

This will be referred to as no cmov, the -fno-if-conversion flag was pointed out by Turix in his answer.

g++ -std=c++11 -O2 -fno-if-conversion -lboost_chrono -lboost_system -lrt branch3.cpp

    xorl    %eax, %eax
.L29:
    movl    (%rbx,%rbp,4), %edx
    cmpl    %edx, (%rbx,%rax,4)
    jge .L28
    movslq  %eax, %rbp
.L28:
    addq    $1, %rax
    cmpq    $100000000, %rax
    jne .L29

The difference side by side

cmpl    %edx, (%rbx,%rax,4) |     cmpl  %edx, (%rbx,%rax,4)
movslq  %eax, %rdx          |     jge   .L28
cmovl   %rdx, %rbp          |     movslq    %eax, %rbp
                            | .L28:

The execution time as a function of the Bernoulli parameter p

effect of branch prediction

The code with the cmov instruction is absolutely insensitive to p. The code without the cmov instruction is the winner if p<0.26 or 0.81<p and is at most 4.38x faster (p=1). Of course, the worse situation for the branch predictor is at around p=0.5 where the code is 1.58x slower than the code with the cmov instruction.

Do you get the same results with real (random) data instead of all zeroes ? — Paul R, Jan 10 '14 at 22:58
Just a guess, but I think the tree part of tree-vectorize means that it's working on the "tree" form of the code (gcc converts your code to a bunch of different forms after parsing it, and tree is one of those forms iirc). Vectorize of course just refers to using vector instructions (SSE, MMX, 3D-Now, etc). — CrazyCasta, Jan 10 '14 at 22:59
Completely random and uniform elements in a selection sort do not cause many branch prediction failures. The first element is k_0 uniform from 0 to N. Each element after that has a k_0/N chance of being smaller, causing a branch. That element is then uniformly selected from 0 to k_0-1. Imagine if it always picked the middle one: then after a logarithmic number of steps, you'd find the smallest one left. On average, that is the performance you get. On a 100000 array, lg is 16, meaning the branch starts out going one way 6000 times more than the other. — Yakk, Jan 11 '14 at 1:27
In effect, your inefficient sort means that each comparison has little entropy, so is highly predictable, so branch prediction is nearly perfect. Now, we can arrange for pathological data. We want each comparison to be 50-50 chance of being bigger or smaller than the current lowest. Set array[i] = +/- i, where the sign is picked randomly, and reprofile (alternating might work, but having a branch predictor that handles alternating would be easy in hardware, so random will be evil enough). — Yakk, Jan 11 '14 at 1:31

Turix · Accepted Answer · 2014-01-19 03:51:28Z

Note: Answered before graph update was added to the question; some assembly code references here may be obsolete.

(Adapted and extended from our above chat, which was stimulating enough to cause me to do a bit more research.)

First (as per our above chat), it appears that the answer to your first question is "yes". In the vector "optimized" code, the optimization (negatively) affecting performance is branch predication, whereas in the original code the performance is (positively) affected by branch prediction. (Note the extra 'a' in the former.)

Re your 3rd question: Even though in your case, there is actually no vectorization being done, from step 11 ("Conditional Execution") here it appears that one of the steps associated with vectorization optimizations is to "flatten" conditionals within targeted loops, like this bit in your loop:

if (a[j] < a[lowerElementIndex]
    lowerElementIndex = j;

Apparently, this happens even if there is no vectorization.

This explains why the compiler is using the conditional move instructions (cmovl). The goal there is to avoid a branch entirely (as opposed to trying to predict it correctly). Instead, the two cmovl instructions will be sent down the pipeline before the result of the previous cmpl is known and the comparison result will then be "forwarded" to enable/prevent the moves prior to their writeback (i.e., prior to them actually taking effect).

Note that if the loop had been vectorized, this might have been worth it to get to the point where multiple iterations through the loop could effectively be accomplished in parallel.

However, in your case, the attempt at optimization actually backfires because in the flattened loop, the two conditional moves are sent through the pipeline every single time through the loop. This in itself might not be so bad either, except that there is a RAW data hazard that causes the second move (cmovl %esi, %ecx) to have to wait until the array/memory access (movl (%rsp,%rsi,4), %esi) is completed, even if the result is going to be ultimately ignored. Hence the huge time spent on that particular cmovl. (I would expect this is an issue with your processor not having complex enough logic built into its predication/forwarding implementation to deal with the hazard.)

On the other hand, in the non-optimized case, as you rightly figured out, branch prediction can help to avoid having to wait on the result of the corresponding array/memory access there (the movl (%rsp,%rcx,4), %ecx instruction). In that case, when the processor correctly predicts a taken branch (which for an all-0 array will be every single time, but [even] in a random array should [still] be ~~roughly~~more than [edited per @Yakk's comment] half the time), it does not have to wait for the memory access to finish to go ahead and queue up the next few instructions in the loop. So in correct predictions, you get a boost, whereas in incorrect predictions, the result is no worse than in the "optimized" case and, furthermore, better because of the ability to sometimes avoid having the 2 "wasted" cmovl instructions in the pipeline.

[The following was removed due to my mistaken assumption about your processor per your comment.]
Back to your questions, I would suggest looking at that link above for more on the flags relevant to vectorization, but in the end, I'm pretty sure that it's fine to ignore that optimization given that your Celeron isn't capable of using it (in this context) anyway.

[Added after above was removed]
Re your second question ("...how can I prevent gcc from emitting this instruction..."), you could try the -fno-if-conversion and -fno-if-conversion2 flags (not sure if these always work -- they no longer work on my mac), although I do not think your problem is with the cmovl instruction in general (i.e., I wouldn't always use those flags), just with its use in this particular context (where branch prediction is going to be very helpful given @Yakk's point about your sort algorithm).

Thanks! Now one more question: What makes you think I have a Celeron? I have a Core i5. How does that change your answer? — Ali, Jan 11 '14 at 1:07
@Ali sorry for the bad assumption. I was going from this bit in the other question: "I forgot to mention that I'm on an old PC with a Intel Celeron at2.8GHz processor and linux (fedora 20 with xfce) installed." which I mistakenly thought was yours, but I now see you just edited. Sorry! And while I'm not sure, I think this should change my answer, as I believe your i5 might support vectorization. (I am going to research that now and will update my answer.) — Turix, Jan 11 '14 at 1:34
No problem at all. Thanks for your efforts, I greatly appreciate it. I have tried and -fno-if-conversion helps but only a little (7.4s reduced to 6.4s) but -fno-tree-vectorize still wins (4.7s). Tested on an evil data set that does its best to break the branch prediction. Now, please check my answer to Yakk's comments. — Ali, Jan 11 '14 at 17:12
@Ali Interesting. I'd be interested in seeing the rest of the assembly code for the s_sort function (i.e., the "outer loop"), and the full profile of the function. I'm wondering if the attempt to vectorize the "inner loop" screws up the obvious optimization of the outer loop. (I can imagine a couple ways this might happen, including for example requiring another unnecessary lookup of a[i].) — Turix, Jan 11 '14 at 19:05
@Ali also, I'm assuming you looked at the assembly code in the -fno-if-conversion case and it seemed sane. I was assuming it would be nearly identical to the -fno-tree-vectorize case, so I'm also interested in what the main differences were that remained from the attempt to vectorize. — Turix, Jan 11 '14 at 19:07

asked	3 years, 6 months ago
viewed	811 times
active	3 years, 6 months ago

Why does tree vectorization make this sorting algorithm 2x slower?

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged c++ c performance gcc assembly or ask your own question.

Linked

Hot Network Questions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c++ c performance gcc assembly or ask your own question.

Linked

Related