Questions tagged [cpu-architecture]
The hardware microarchitecture (x86, x86_64, ARM, ...) of a CPU or microcontroller.
cpu-architecture
860
questions
27243
votes
25
answers
1.9m
views
Why is processing a sorted array faster than processing an unsorted array?
In this C++ code, sorting the data (before the timed region) makes the primary loop ~6x faster:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// ...
98
votes
3
answers
15k
views
Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?
LOOP (Intel ref manual entry)
decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-...
63
votes
4
answers
8k
views
Micro fusion and addressing modes
I have found something unexpected (to me) using the Intel® Architecture Code Analyzer (IACA).
The following instruction using [base+index] addressing
addps xmm1, xmmword ptr [rsi+rax*1]
does not ...
20
votes
1
answer
2k
views
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular ...
36
votes
3
answers
9k
views
Why doesn't GCC use partial registers?
Disassembling write(1,"hi",3) on linux, built with gcc -s -nostdlib -nostartfiles -O3 results in:
ba03000000 mov edx, 3 ; thanks for the correction jester!
bf01000000 mov edi, 1
31c0 ...
47
votes
2
answers
8k
views
Can x86's MOV really be "free"? Why can't I reproduce this at all?
I keep seeing people claim that the MOV instruction can be free in x86, because of register renaming.
For the life of me, I can't verify this in a single test case. Every test case I try debunks ...
268
votes
3
answers
54k
views
How much of �What Every Programmer Should Know About Memory’ is still valid?
I am wondering how much of Ulrich Drepper's What Every Programmer Should Know About Memory from 2007 is still valid. Also I could not find a newer version than 1.0 or an errata.
(Also in PDF form on ...
105
votes
6
answers
33k
views
Enhanced REP MOVSB for memcpy
I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy.
ERMSB was introduced with the Ivy Bridge microarchitecture. See the section "Enhanced REP MOVSB and ...
53
votes
2
answers
4k
views
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
This loop runs at one iteration per 3 cycles on Intel Conroe/Merom, bottlenecked on imul throughput as expected. But on Haswell/Skylake, it runs at one iteration per 11 cycles, apparently because ...
77
votes
5
answers
81k
views
How many CPU cycles are needed for each assembly instruction?
I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle ...
11
votes
1
answer
3k
views
Can a speculatively executed CPU branch contain opcodes that access RAM?
As I understand, when a CPU speculatively executes a piece of code, it "backs up" the register state before switching to the speculative branch, so that if the prediction turns out wrong (...
17
votes
2
answers
3k
views
Are loads and stores the only instructions that gets reordered?
I have read many articles on memory ordering, and all of them only say that a CPU reorders loads and stores.
Does a CPU (I'm specifically interested in an x86 CPU) only reorders loads and stores, and ...
41
votes
4
answers
9k
views
Observing stale instruction fetching on x86 with self-modifying code
I've been told and have read from Intel's manuals that it is possible to write instructions to memory, but the instruction prefetch queue has already fetched the stale instructions and will execute ...
19
votes
2
answers
5k
views
Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?
We've got a simple memory throughput benchmark. All it does is memcpy repeatedly for a large block of memory.
Looking at the results (compiled for 64-bit) on a few different machines, Skylake ...
9
votes
1
answer
2k
views
Adding a redundant assignment speeds up code when compiled without optimization
I find an interesting phenomenon:
#include<stdio.h>
#include<time.h>
int main() {
int p, q;
clock_t s,e;
s=clock();
for(int i = 1; i < 1000; i++){
for(int j = ...
19
votes
1
answer
9k
views
Which cache mapping technique is used in intel core i7 processor?
I have learned about different cache mapping techniques like direct mapping and fully associative or set associative mapping, and the trade-offs between those. (Wikipedia)
But I am curious which one ...
35
votes
3
answers
5k
views
Is performance reduced when executing loops whose uop count is not a multiple of processor width?
I'm wondering how loops of various sizes perform on recent x86 processors, as a function of number of uops.
Here's a quote from Peter Cordes who raised the issue of non-multiple-of-4 counts in ...
52
votes
1
answer
22k
views
gcc optimization flag -O3 makes code slower than -O2
I find this topic Why is it faster to process a sorted array than an unsorted array? . And try to run this code. And I find strange behavior. If I compile this code with -O3 optimization flag it takes ...
240
votes
4
answers
107k
views
What is the purpose of the "Prefer 32-bit" setting in Visual Studio and how does it actually work?
It is unclear to me how the compiler will automatically know to compile for 64-bit when it needs to. How does it know when it can confidently target 32-bit?
I am mainly curious about how the compiler ...
42
votes
7
answers
33k
views
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors ?
700
votes
4
answers
91k
views
How do I achieve the theoretical maximum of 4 FLOPs per cycle?
How can the theoretical peak performance of 4 floating point operations (double precision) per cycle be achieved on a modern x86-64 Intel CPU?
As far as I understand it takes three cycles for an SSE ...
98
votes
9
answers
32k
views
System where 1 byte != 8 bit? [duplicate]
All the time I read sentences like
don't rely on 1 byte being 8 bit in size
use CHAR_BIT instead of 8 as a constant to convert between bits and bytes
et cetera. What real life systems are there ...
17
votes
2
answers
2k
views
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
I was playing with the code in this answer, slightly modifying it:
BITS 64
GLOBAL _start
SECTION .text
_start:
mov ecx, 1000000
.loop:
;T is a symbol defined with the CLI (-DT=...)
TIMES T ...
12
votes
2
answers
1k
views
Are there any modern CPUs where a cached byte store is actually slower than a word store?
It's a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register.
But I've never seen any ...
8
votes
3
answers
2k
views
Globally Invisible load instructions
Can some of the load instructions be never globally visible due to store load forwarding ? To put it another way, if a load instruction gets its value from the store buffer, it never has to read from ...
6
votes
2
answers
5k
views
Why isn't movl from memory to memory allowed?
I was wondering if this is allowed in assembly,
movl (%edx) (%eax)
I would have guessed that it access the memory in the first operand and puts in
the memory of the second operand, something like ...
55
votes
3
answers
7k
views
How are x86 uops scheduled, exactly?
Modern x86 CPUs break down the incoming instruction stream into micro-operations (uops1) and then schedule these uops out-of-order as their inputs become ready. While the basic idea is clear, I'd like ...
42
votes
2
answers
4k
views
Why does breaking the "output dependency" of LZCNT matter?
While benchmarking something I measured a much lower throughput than I had calculated, which I narrowed down to the LZCNT instruction (it also happens with TZCNT), as demonstrated in the following ...
23
votes
1
answer
5k
views
Does lock xchg have the same behavior as mfence?
What I'm wondering is if lock xchg will have similar behavior to mfence from the perspective of one thread accessing a memory location that is being mutated (lets just say at random) by other threads. ...
12
votes
1
answer
3k
views
What exactly happens when a skylake CPU mispredicts a branch?
I'm trying to understand in detail what happens to instructions in the various stages of the skylake CPU pipeline when a branch is mis-predicted, and how quickly instructions from the correct branch ...
20
votes
2
answers
3k
views
What is the stack engine in the Sandybridge microarchitecture?
I am reading http://www.realworldtech.com/sandy-bridge/ and I'm facing some problems in understanding some issues:
The dedicated stack pointer tracker is also present in Sandy Bridge
and renames ...
17
votes
2
answers
2k
views
Problems with ADC/SBB and INC/DEC in tight loops on some CPUs
I am writing a simple BigInteger type in Delphi. It mainly consists of a dynamic array of TLimb, where a TLimb is a 32 bit unsigned integer, and a 32 bit size field, which also holds the sign bit for ...
33
votes
2
answers
18k
views
When an interrupt occurs, what happens to instructions in the pipeline?
Assume a 5-stage pipeline architecture (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). There are 4 instructions that has to be executed.
...
32
votes
1
answer
8k
views
What happens after a L2 TLB miss?
I'm struggling to understand what happens when the first two levels of the Translation Lookaside Buffer result in misses?
I am unsure whether "page walking" occurs in special hardware circuitry, or ...
8
votes
2
answers
2k
views
How does MIPS I handle branching on the previous ALU instruction without stalling?
addiu $6,$6,5
bltz $6,$L5
nop
...
$L5:
How is this safe without stalling, which classic MIPS couldn't even do, except on cache miss?
(MIPS originally stood for ...
60
votes
2
answers
70k
views
FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2
I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell.
As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core ...
138
votes
6
answers
127k
views
What is the "FS"/"GS" register intended for?
So I know what the following registers and their uses are supposed to be:
CS = Code Segment (used for IP)
DS = Data Segment (used for MOV)
ES = Destination Segment (used for MOVS, etc.)
SS = Stack ...
18
votes
4
answers
5k
views
What setup does REP do?
Quoting Intel® 64 and IA-32 architectures optimization reference manual, §2.4.6 "REP String Enhancement":
The performance characteristics of using REP string can be attributed to two components:
...
14
votes
2
answers
2k
views
What is a Partial Flag Stall?
I was just going over this answer by Peter Cordes and he says,
Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be ...
13
votes
4
answers
4k
views
How does memory reordering help processors and compilers?
I studied the Java memory model and saw re-ordering problems. A simple example:
boolean first = false;
boolean second = false;
void setValues() {
first = true;
second = true;
}
void ...
12
votes
1
answer
4k
views
Slow jmp-instruction
As follow up to my question The advantages of using 32bit registers/instructions in x86-64, I started to measure the costs of instructions. I'm aware that this have been done multiple times (e.g. ...
348
votes
4
answers
50k
views
Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs
I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions:
Your ...
129
votes
11
answers
140k
views
Floating point vs integer calculations on modern hardware
I am doing some performance critical work in C++, and we are currently using integer calculations for problems that are inherently floating point because "its faster". This causes a whole ...
37
votes
2
answers
22k
views
Atomicity of loads and stores on x86
8.1.2 Bus Locking
Intel 64 and IA-32 processors provide a LOCK# signal that is asserted
automatically during certain critical memory operations to lock the
system bus or equivalent link. While this ...
9
votes
2
answers
10k
views
why we can't move a 64-bit immediate value to memory?
First I am a little bit confused with the differences between movq and movabsq, my text book says:
The regular movq instruction can only have immediate source operands that can be represented as 32-...
4
votes
2
answers
2k
views
Latency bounds and throughput bounds for processors for operations that must occur in sequence
My textbook (Computer Systems: A programmer's perspective) states that a latency bound is encountered when a series of operations must be performed in strict sequence, while a throughput bound ...
13
votes
3
answers
3k
views
Avoid stalling pipeline by calculating conditional early
When talking about the performance of ifs, we usually talk about how mispredictions can stall the pipeline. The recommended solutions I see are:
Trust the branch predictor for conditions that usually ...
229
votes
5
answers
131k
views
How do cache lines work?
I understand that the processor brings data into the cache via cache lines, which - for instance, on my Atom processor - brings in about 64 bytes at a time, whatever the size of the actual data being ...
63
votes
10
answers
117k
views
Maximum memory which malloc can allocate
I was trying to figure out how much memory I can malloc to maximum extent on my machine
(1 Gb RAM 160 Gb HD Windows platform).
I read that the maximum memory malloc can allocate is limited to ...
10
votes
2
answers
3k
views
Is LFENCE serializing on AMD processors?
In recent Intel ISA documents the lfence instruction has been defined as serializing the instruction stream (preventing out-of-order execution across it). In particular, the description of the ...