This question already has an answer here:

Note: the alleged duplicate question is, I think, mostly related to "<" and ">" comparison, but not "==" comparison and hence does not answer my question about performance of the "==" operator.

For a long time I have believed that "processing" a sorted array should be faster than an unsorted array. At first I thought using "==" in sorted array should be faster than in unsorted array because - I guess - of how branch prediction works:

UNSORTEDARRAY:

5 == 100 F
43 == 100 F
100 == 100 T
250 == 100 F
6 == 100 F
(other elements to check)

SORTEDARRAY:

5 == 100 F
6 == 100 F
43 == 100 F
100 == 100 T
(no need to check other elements, so all are F)

so I guess SORTEDARRAY should be faster than UNSORTEDARRAY, but today I used the code to generate 2 arrays in a header to test, and branch prediction seemed not to work as I thought.

I generated an unsorted array and a sorted array to test:

srand(time(NULL));
int UNSORTEDARRAY[524288];
int SORTEDARRAY[sizeof(UNSORTEDARRAY)/sizeof(int)];
for(int i=0;i<sizeof(SORTEDARRAY)/sizeof(int);i++){
    SORTEDARRAY[i]=UNSORTEDARRAY[i]=rand();
}
sort(SORTEDARRAY,SORTEDARRAY+sizeof(SORTEDARRAY)/sizeof(int));
string u="const int UNSORTEDARRAY[]={";
string s="const int SORTEDARRAY[]={";
for(int i=0;i<sizeof(UNSORTEDARRAY)/sizeof(int);i++){
    u+=to_string(UNSORTEDARRAY[i])+",";
    s+=to_string(SORTEDARRAY[i])+",";
}
u.erase(u.end()-1);
s.erase(s.end()-1);
u+="};\n";
s+="};\n";
ofstream out("number.h");
string code=u+s;
out << code;
out.close();

so to test, just count if the value is == RAND_MAX/2 like this:

#include "number.h"
int main(){
int count;
    clock_t start = clock();
    for(int i=0;i<sizeof(SORTEDARRAY)/sizeof(int);i++){
        if(SORTEDARRAY[i]==RAND_MAX/2){
            count++;
        }
    }
    printf("%f\n",(float)(clock()-start)/CLOCKS_PER_SEC);
}

run 3 times:

UNSORTEDARRAY

0.005376
0.005239
0.005220

SORTEDARRAY

0.005334
0.005120
0.005223

it seems a small performance difference, so I didn't believe it and then tried to change "SORTEDARRAY[i]==RAND_MAX/2" to "SORTEDARRAY[i]>RAND_MAX/2" to see if it made a difference:

UNSORTEDARRAY

0.008407
0.008363
0.008606

SORTEDARRAY

0.005306
0.005227
0.005146

this time there is a great difference.

Is "==" in sorted array not faster than unsorted array? If yes, why is ">" in sorted array faster than an unsorted array but "==" is not?

share|improve this question

marked as duplicate by imallett, Seb, Nik Bougalis, bytecode77, Rob Aug 18 '15 at 6:11

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

5  
Related to one of the most upvoted questions of all times: stackoverflow.com/questions/11227809/… – Andrzej Pronobis Aug 18 '15 at 3:59
    
"I believe "processing" a sorted array should be faster than unsorted array": try to answer yourself why you think that's true for this algorithm. That is - what kind of work and how much of it you do for each case. You may realise what the answer is. – viraptor Aug 18 '15 at 4:02
    
string is not a standard type in C, and using the += operator with one operand of type string and the other char * makes no sense. Are you sure this isn't C++ code? – Seb Aug 18 '15 at 5:29
    
Additionally, what are you using to time this code? Something very inaccurate, and probably biased. This kind of question is usually written by misinformed people Do you even have full optimisation enabled? Do you have an actual realistic problem to solve, and a program to solve that problem? Are you using a profiler on that program to determine what the significant bottlenecks are? The reason I ask is, in any realistic scenario the bottlenecks will vary considerably from what you have described. This question has no practical use. – Seb Aug 18 '15 at 5:35
    
We don't even know what compiler you're using! Perhaps the bias you've introduced in your full mcve (which we also don't have) varies depending on the compiler you use, and the system you're programming for? This question is too broad, and so I'm voting to close it. – Seb Aug 18 '15 at 5:37

One thing that immediately comes to mind is CPU's branch prediction algorithm.

In case of > comparison, in sorted array the branching behavior is very consistent: first, the if condition is consistently false, then it is consistently true. This aligns very well with even the simplest branch prediction.

In unsorted array the result of > condition is essentially random, thus thwarting any branch prediction.

This is what makes the sorted version faster.

In case of == comparison, most of the time the condition is false and only very rarely it is true. This works well with branch prediction regardless of whether the array is sorted or not. The timings are essentially the same.

share|improve this answer

N.B. I'm making this an answer since it's too long for a comment.

The effect here is exactly what is already explained in great detail in the plentiful answers to this question. Processing a sorted array was faster in that case because of branch prediction.

Here, the culprit is again branch prediction. The == test is very seldom true, so branch prediction works about the same for both. When you changed it to >, then you got the behavior explained in that question, and for the same reason.


Moral:

I believe "processing" a sorted array should be faster than [an ]unsorted array.

You need to know why. This isn't some magical rule, and it isn't always true.

share|improve this answer

The comparison == has less of a relation to ordering than > does. Correctly or incorrectly predicting == would only depend on the number of duplicate values and their distribution.

You can use perf stat to view the performance counters...

jason@io /tmp $ lz4 -d ints | perf stat ./proc-eq >/dev/null
Successfully decoded 104824717 bytes

 Performance counter stats for './proc-eq':

       5226.932577      task-clock (msec)         #    0.953 CPUs utilized
                31      context-switches          #    0.006 K/sec
                24      cpu-migrations            #    0.005 K/sec
             3,479      page-faults               #    0.666 K/sec
    15,763,486,767      cycles                    #    3.016 GHz
     4,238,973,549      stalled-cycles-frontend   #   26.89% frontend cycles idle
   <not supported>      stalled-cycles-backend
    31,522,072,416      instructions              #    2.00  insns per cycle
                                                  #    0.13  stalled cycles per insn
     8,515,545,178      branches                  # 1629.167 M/sec
        10,261,743      branch-misses             #    0.12% of all branches

       5.483071045 seconds time elapsed

jason@io /tmp $ lz4 -d ints | sort -n | perf stat ./proc-eq >/dev/null
Successfully decoded 104824717 bytes

 Performance counter stats for './proc-eq':

       5536.031410      task-clock (msec)         #    0.348 CPUs utilized
               198      context-switches          #    0.036 K/sec
                21      cpu-migrations            #    0.004 K/sec
             3,604      page-faults               #    0.651 K/sec
    16,870,541,124      cycles                    #    3.047 GHz
     5,300,218,855      stalled-cycles-frontend   #   31.42% frontend cycles idle
   <not supported>      stalled-cycles-backend
    31,526,006,118      instructions              #    1.87  insns per cycle
                                                  #    0.17  stalled cycles per insn
     8,516,336,829      branches                  # 1538.347 M/sec
        10,980,571      branch-misses             #    0.13% of all branches

jason@io /tmp $ lz4 -d ints | perf stat ./proc-gt >/dev/null
Successfully decoded 104824717 bytes

 Performance counter stats for './proc-gt':

       5293.065703      task-clock (msec)         #    0.957 CPUs utilized
                38      context-switches          #    0.007 K/sec
                50      cpu-migrations            #    0.009 K/sec
             3,466      page-faults               #    0.655 K/sec
    15,972,451,322      cycles                    #    3.018 GHz
     4,350,726,606      stalled-cycles-frontend   #   27.24% frontend cycles idle
   <not supported>      stalled-cycles-backend
    31,537,365,299      instructions              #    1.97  insns per cycle
                                                  #    0.14  stalled cycles per insn
     8,515,606,640      branches                  # 1608.823 M/sec
        15,241,198      branch-misses             #    0.18% of all branches

       5.532285374 seconds time elapsed

jason@io /tmp $ lz4 -d ints | sort -n | perf stat ./proc-gt >/dev/null

      15.930144154 seconds time elapsed

 Performance counter stats for './proc-gt':

       5203.873321      task-clock (msec)         #    0.339 CPUs utilized
                 7      context-switches          #    0.001 K/sec
                22      cpu-migrations            #    0.004 K/sec
             3,459      page-faults               #    0.665 K/sec
    15,830,273,846      cycles                    #    3.042 GHz
     4,456,369,958      stalled-cycles-frontend   #   28.15% frontend cycles idle
   <not supported>      stalled-cycles-backend
    31,540,409,224      instructions              #    1.99  insns per cycle
                                                  #    0.14  stalled cycles per insn
     8,516,186,042      branches                  # 1636.509 M/sec
        10,205,058      branch-misses             #    0.12% of all branches

      15.365528326 seconds time elapsed
share|improve this answer

Not the answer you're looking for? Browse other questions tagged or ask your own question.