Timeline for Why is processing a sorted array faster than processing an unsorted array?
Current License: CC BY-SA 4.0
13 events
when toggle format | what | by | license | comment | |
---|---|---|---|---|---|
Apr 22, 2022 at 13:20 | comment | added | user2297550 |
For any users reading this, the above three and counting comments add nothing material beyond the first four comments on this answer, and constitute persistent trolling by a user whose numerous earlier comments on this same answer were thankfully deleted by mods. This is the only answer with an O(n) solution that works in general on any machine architecture and for any computation not just summation. Apparently, this simple fact cannot be digested or comprehended despite Herculean earlier attempts.
|
|
Apr 22, 2022 at 12:29 | comment | added | Peter Cordes | Commenting out the sorting entirely keeps the whole thing O(n). That's the appropriate comparison baseline for an O(n) approach that spends time preprocessing the array to make one pass of a branchy loop faster, of course not an O(n log n) sort. | |
Apr 22, 2022 at 8:42 | comment | added | user2297550 |
For any users reading this, the above two comments add nothing material beyond the first four, and constitute persistent trolling by a user whose numerous earlier comments on this same answer were thankfully deleted by mods. This is the only answer with an O(n) solution that works in general on any machine architecture and for any computation not just summation. Apparently, this simple fact cannot be digested or comprehended despite Herculean earlier attempts.
|
|
Apr 22, 2022 at 1:47 | comment | added | Peter Cordes | For one pass of the inner loop over the array, it would be even faster to just do it, not to sort first. Even partial sorting is only worth it if you can amortize the benefit over multiple passes over the array. That is O(n), but still costs significant time to do comparison-based sorting even in small chunks, branching on each element multiple times. As this says, The assumption by other answers that one needs to sort the data is not correct. - you don't need to do any sorting at all if you care about overall time taken, without being able to do sorting outside the timed region. | |
Apr 22, 2022 at 1:45 | comment | added | Peter Cordes |
Just for the record, since our previous comments have been nuked, I don't think there's anything to gain in overall performance by spending time (partially) sorting, unless you're artificially repeating the loop over the array like in this microbenchmark. Then yes, this piecewise sort gets close to the benefit of a full sort (e.g. 2.4s for this vs. 1.7s for a full sort on Skylake, vs. 10.9s for no sort before doing 100k passes. If you use g++ -Os -Wa,-mbranches-within-32B-boundaries to make branchy asm in the first place instead of a normal build).
|
|
Jun 19, 2021 at 17:50 | comment | added | user2297550 |
Also note that specialized hardware instructions do not help if the computation within the if is more complicated than an simple addition, which is quite likely in the general case. Therefore, this answer is unique in offering a general solution that is still O(n)
|
|
Apr 3, 2020 at 6:44 | comment | added | user2297550 | Both comments above seem to ignore the general algorithmic issues and complexity, in favor of advocating specialized hardware with special machine instructions. I find the first one particularly petty in that it blithely dismisses the important general insights in this answer in blind favor of specialized machine instructions. | |
Dec 13, 2019 at 14:44 | comment | added | Peter Cordes |
If you just want to implement the algorithm efficiently over unsorted data, you would do that operation branchlessly (and with SIMD, e.g. with x86 pcmpgtb to find elements with their high bit set, then AND to zero smaller elements). Spending any time actually sorting chunks would be slower. A branchless version would have data-independent performance, also proving that the cost came from branch misprediction. Or just use performance counters to observe that directly, like Skylake int_misc.clear_resteer_cycles or int_misc.recovery_cycles to count front-end idle cycles from mispredicts
|
|
Jul 2, 2019 at 4:08 | history | edited | user2297550 | CC BY-SA 4.0 |
added 57 characters in body
|
Feb 28, 2019 at 15:24 | history | edited | user2297550 | CC BY-SA 4.0 |
edited body
|
Feb 28, 2019 at 12:18 | comment | added | Luke Hutchison | I don't really see how this proves anything? The only thing you have shown is that "not doing all the work of sorting the whole array takes less time than sorting the whole array". Your claim that this "also runs fastest" is very architecture-dependent. See my answer about how this works on ARM. PS you could make your code faster on non-ARM architectures by putting the summation inside the 200-element block loop, sorting in reverse, and then using Yochai Timmer's suggestion of breaking once you get an out-of range value. That way each 200-element block summation can be terminated early. | |
Feb 18, 2019 at 16:03 | history | edited | user2297550 | CC BY-SA 4.0 |
deleted 150 characters in body
|
Dec 9, 2018 at 6:18 | history | answered | user2297550 | CC BY-SA 4.0 |