13 events
when toggle format what by license comment
Apr 22, 2022 at 13:20 comment added user2297550 For any users reading this, the above three and counting comments add nothing material beyond the first four comments on this answer, and constitute persistent trolling by a user whose numerous earlier comments on this same answer were thankfully deleted by mods. This is the only answer with an O(n) solution that works in general on any machine architecture and for any computation not just summation. Apparently, this simple fact cannot be digested or comprehended despite Herculean earlier attempts.
Apr 22, 2022 at 12:29 comment added Peter Cordes Commenting out the sorting entirely keeps the whole thing O(n). That's the appropriate comparison baseline for an O(n) approach that spends time preprocessing the array to make one pass of a branchy loop faster, of course not an O(n log n) sort.
Apr 22, 2022 at 8:42 comment added user2297550 For any users reading this, the above two comments add nothing material beyond the first four, and constitute persistent trolling by a user whose numerous earlier comments on this same answer were thankfully deleted by mods. This is the only answer with an O(n) solution that works in general on any machine architecture and for any computation not just summation. Apparently, this simple fact cannot be digested or comprehended despite Herculean earlier attempts.
Apr 22, 2022 at 1:47 comment added Peter Cordes For one pass of the inner loop over the array, it would be even faster to just do it, not to sort first. Even partial sorting is only worth it if you can amortize the benefit over multiple passes over the array. That is O(n), but still costs significant time to do comparison-based sorting even in small chunks, branching on each element multiple times. As this says, The assumption by other answers that one needs to sort the data is not correct. - you don't need to do any sorting at all if you care about overall time taken, without being able to do sorting outside the timed region.
Apr 22, 2022 at 1:45 comment added Peter Cordes Just for the record, since our previous comments have been nuked, I don't think there's anything to gain in overall performance by spending time (partially) sorting, unless you're artificially repeating the loop over the array like in this microbenchmark. Then yes, this piecewise sort gets close to the benefit of a full sort (e.g. 2.4s for this vs. 1.7s for a full sort on Skylake, vs. 10.9s for no sort before doing 100k passes. If you use g++ -Os -Wa,-mbranches-within-32B-boundaries to make branchy asm in the first place instead of a normal build).
Jun 19, 2021 at 17:50 comment added user2297550 Also note that specialized hardware instructions do not help if the computation within the if is more complicated than an simple addition, which is quite likely in the general case. Therefore, this answer is unique in offering a general solution that is still O(n)
Apr 3, 2020 at 6:44 comment added user2297550 Both comments above seem to ignore the general algorithmic issues and complexity, in favor of advocating specialized hardware with special machine instructions. I find the first one particularly petty in that it blithely dismisses the important general insights in this answer in blind favor of specialized machine instructions.
Dec 13, 2019 at 14:44 comment added Peter Cordes If you just want to implement the algorithm efficiently over unsorted data, you would do that operation branchlessly (and with SIMD, e.g. with x86 pcmpgtb to find elements with their high bit set, then AND to zero smaller elements). Spending any time actually sorting chunks would be slower. A branchless version would have data-independent performance, also proving that the cost came from branch misprediction. Or just use performance counters to observe that directly, like Skylake int_misc.clear_resteer_cycles or int_misc.recovery_cycles to count front-end idle cycles from mispredicts
Jul 2, 2019 at 4:08 history edited user2297550 CC BY-SA 4.0
added 57 characters in body
Feb 28, 2019 at 15:24 history edited user2297550 CC BY-SA 4.0
edited body
Feb 28, 2019 at 12:18 comment added Luke Hutchison I don't really see how this proves anything? The only thing you have shown is that "not doing all the work of sorting the whole array takes less time than sorting the whole array". Your claim that this "also runs fastest" is very architecture-dependent. See my answer about how this works on ARM. PS you could make your code faster on non-ARM architectures by putting the summation inside the 200-element block loop, sorting in reverse, and then using Yochai Timmer's suggestion of breaking once you get an out-of range value. That way each 200-element block summation can be terminated early.
Feb 18, 2019 at 16:03 history edited user2297550 CC BY-SA 4.0
deleted 150 characters in body
Dec 9, 2018 at 6:18 history answered user2297550 CC BY-SA 4.0