Revisions to Why is processing a sorted array faster than processing an unsorted array?

added 20 characters in body

Source Link

edited Oct 20, 2020 at 22:33

6.9k
2
36
31

MOV R0, #0     // R0 = sum = 0
MOV R1, #0     // R1 = c = 0
ADR R2, data   // R2 = addr of data array (put this instruction outside outer loop)
.inner_loop    // Inner loop branch label
    LDRB R3, [R2, R1]     // R3 = data[c]
    CMP R3, #128          // compare R3 to 128
    ADDGE R0, R0, R3      // if R3 >= 128, then sum += data[c] -- no branch needed!
    ADD R1, R1, #1        // c++
    CMP R1, #arraySize    // compare c to arraySize
    BLT inner_loop        // Branch to inner_loop if c < arraySize

When you don't have to branch, you can avoid the time cost of flushing the pipeline for what would otherwise be short branches, and you can avoid the design complexity of many forms of speculative evalution. The performance impact of the initial naive imlementations of the mitigations for many recently discovered processor vulnerabilities (Spectre etc.) shows you just how much the performance of modern processors depends upon complex speculative evaluation logic. With a short pipeline and the dramatically reduced need for branching, ARM just doesn't need to rely on speculative evaluation as much as CISC processors. (Of course high-end ARM implementations do include speculative evaluation, but it's a smaller part of the performance story.)

If you have ever wondered why ARM has been so phenomenally successful, the brilliant effectiveness and interplay of these two mechanisms (combined with another mechanism that lets you "barrel shift" left or right one of the two arguments of any arithmetic operator or offset memory access operator at zero additional cost) are a big part of the story, because they are onesome of the greatest sources of the ARM architecture's efficiency. The brilliance of the original designers of the ARM ISA back in 1983, Steve Furber and Roger (now Sophie) Wilson, cannot be overstated.

MOV R0, #0     // R0 = sum = 0
MOV R1, #0     // R1 = c = 0
ADR R2, data   // R2 = addr of data array (put this instruction outside outer loop)
.inner_loop    // Inner loop branch label
    LDRB R3, [R2, R1]     // R3 = data[c]
    CMP R3, #128          // compare R3 to 128
    ADDGE R0, R0, R3      // if R3 >= 128, then sum += data[c] -- no branch needed!
    ADD R1, R1, #1        // c++
    CMP R1, #arraySize    // compare c to arraySize
    BLT inner_loop        // Branch to inner_loop if c < arraySize

If you have ever wondered why ARM has been so phenomenally successful, the brilliant effectiveness and interplay of these two mechanisms are a big part of the story, because they are one of the greatest sources of the ARM architecture's efficiency. The brilliance of the original designers of the ARM ISA back in 1983, Steve Furber and Roger (now Sophie) Wilson, cannot be overstated.

MOV R0, #0   // R0 = sum = 0
MOV R1, #0   // R1 = c = 0
ADR R2, data // R2 = addr of data array (put this instruction outside outer loop)
.inner_loop  // Inner loop branch label
    LDRB R3, [R2, R1]   // R3 = data[c]
    CMP R3, #128        // compare R3 to 128
    ADDGE R0, R0, R3    // if R3 >= 128, then sum += data[c] -- no branch needed!
    ADD R1, R1, #1      // c++
    CMP R1, #arraySize  // compare c to arraySize
    BLT inner_loop      // Branch to inner_loop if c < arraySize

When you don't have to branch, you can avoid the time cost of flushing the pipeline for what would otherwise be short branches, and you can avoid the design complexity of many forms of speculative evalution. The performance impact of the initial naive imlementations of the mitigations for many recently discovered processor vulnerabilities (Spectre etc.) shows you just how much the performance of modern processors depends upon complex speculative evaluation logic. With a short pipeline and the dramatically reduced need for branching, ARM just doesn't need to rely on speculative evaluation as much as CISC processors. (Of course high-end ARM implementations do include speculative evaluation, but it's a smaller part of the performance story.)

If you have ever wondered why ARM has been so phenomenally successful, the brilliant effectiveness and interplay of these two mechanisms (combined with another mechanism that lets you "barrel shift" left or right one of the two arguments of any arithmetic operator or offset memory access operator at zero additional cost) are a big part of the story, because they are some of the greatest sources of the ARM architecture's efficiency. The brilliance of the original designers of the ARM ISA back in 1983, Steve Furber and Roger (now Sophie) Wilson, cannot be overstated.

added 20 characters in body

Source Link

edited Oct 20, 2020 at 22:26

Luke Hutchison

6.9k
2
36
31

CMP opcodes always update the status bits in the Processor Status Register (PSR), because that is their purpose, but most other instructions do not touch the PSR unless you add an optional S suffix to the instruction, specifying that the PSR should be updated based on the result of the instruction. Just like the 4-bit condition suffix, being able to execute instructions without affecting the PSR is a mechanism that reduces the need for branches on ARM, and also facilitates out of order dispatch at the hardware level, because after performing some operation X that updates the status bits, subsequently (or in parallel) you can do a bunch of other work that explicitly should not affect (or be affected by) the status bits, then you can test the state of the status bits set earlier by X.

CMP opcodes always update the status bits in the Processor Status Register (PSR), because that is their purpose, but most other instructions do not touch the PSR unless you add an optional S suffix to the instruction, specifying that the PSR should be updated based on the result of the instruction. Just like the 4-bit condition suffix, being able to execute instructions without affecting the PSR is a mechanism that reduces the need for branches on ARM, and also facilitates out of order dispatch at the hardware level, because after performing some operation X that updates the status bits, subsequently (or in parallel) you can do a bunch of other work that explicitly should not affect the status bits, then you can test the state of the status bits set earlier by X.

CMP opcodes always update the status bits in the Processor Status Register (PSR), because that is their purpose, but most other instructions do not touch the PSR unless you add an optional S suffix to the instruction, specifying that the PSR should be updated based on the result of the instruction. Just like the 4-bit condition suffix, being able to execute instructions without affecting the PSR is a mechanism that reduces the need for branches on ARM, and also facilitates out of order dispatch at the hardware level, because after performing some operation X that updates the status bits, subsequently (or in parallel) you can do a bunch of other work that explicitly should not affect (or be affected by) the status bits, then you can test the state of the status bits set earlier by X.

added 206 characters in body

Source Link

edited Feb 21, 2020 at 12:04

Luke Hutchison

6.9k
2
36
31

This ability to mix and match branch-free condition testing on any instruction with the option to either update or not update the status bits after any instruction is extremely powerful, for both assembly language programmers and compilers. Most processorsprocessor architectures do not have this ability to specify whether or not the status bits should be updated for a given operation, which can necessitate writing additional code to save and later restore status bits, or may require additional branches, or may limit the processor's out of order execution efficiency: one of the side effects of most CPU instruction set architectures forcibly updating status bits after most instructions is that it is much harder to tease apart which instructions can be run in parallel without interfering with each other. Updating status bits has side effects, therefore has a linearizing effect on code.

If you have ever wondered why ARM has been so phenomenally successful, the brilliant effectivenessARM's ability to mix and interplay ofmatch branch-free condition testing on any instruction with the two mechanisms described above are a big part ofoption to either update or not update the storystatus bits after any instruction is extremely powerful, because they are one of the greatest sources of the ARM architecture's efficiencyfor both assembly language programmers and compilers, and produces very efficient code.

If you have ever wondered why ARM has been so phenomenally successful, the brilliant effectiveness and interplay of these two mechanisms are a big part of the story, because they are one of the greatest sources of the ARM architecture's efficiency. The brilliance of the original designers of the ARM ISA back in 1983, Steve Furber and Roger (now Sophie) Wilson, cannot be overstated.

This ability to mix and match branch-free condition testing on any instruction with the option to either update or not update the status bits after any instruction is extremely powerful, for both assembly language programmers and compilers. Most processors do not have this ability to specify whether or not the status bits should be updated for a given operation, which can necessitate writing additional code to save and later restore status bits, or may require additional branches, or may limit the processor's out of order execution efficiency: one of the side effects of most CPU instruction set architectures forcibly updating status bits after most instructions is that it is much harder to tease apart which instructions can be run in parallel without interfering with each other. Updating status bits has side effects, therefore has a linearizing effect on code.

If you have ever wondered why ARM has been so phenomenally successful, the brilliant effectiveness and interplay of the two mechanisms described above are a big part of the story, because they are one of the greatest sources of the ARM architecture's efficiency.

Most processor architectures do not have this ability to specify whether or not the status bits should be updated for a given operation, which can necessitate writing additional code to save and later restore status bits, or may require additional branches, or may limit the processor's out of order execution efficiency: one of the side effects of most CPU instruction set architectures forcibly updating status bits after most instructions is that it is much harder to tease apart which instructions can be run in parallel without interfering with each other. Updating status bits has side effects, therefore has a linearizing effect on code. ARM's ability to mix and match branch-free condition testing on any instruction with the option to either update or not update the status bits after any instruction is extremely powerful, for both assembly language programmers and compilers, and produces very efficient code.

If you have ever wondered why ARM has been so phenomenally successful, the brilliant effectiveness and interplay of these two mechanisms are a big part of the story, because they are one of the greatest sources of the ARM architecture's efficiency. The brilliance of the original designers of the ARM ISA back in 1983, Steve Furber and Roger (now Sophie) Wilson, cannot be overstated.

added 206 characters in body

Source Link

edited Feb 21, 2020 at 11:58

Luke Hutchison

6.9k
2
36
31

Loading

added 206 characters in body

Source Link

edited Feb 21, 2020 at 11:47

Luke Hutchison

6.9k
2
36
31

Loading

added 206 characters in body

Source Link

edited Feb 21, 2020 at 11:09

Luke Hutchison

6.9k
2
36
31

Loading

Rearrange for clarity, and prevent fake comments

Source Link

edited May 14, 2018 at 14:01

jpaugh

6.1k
4
35
86

Loading

Source Link

created Dec 22, 2017 at 13:13

Luke Hutchison

6.9k
2
36
31

Loading

current community

your communities

more stack exchange communities

Return to Answer