Skip to main content

Questions tagged [cpu-architecture]

The hardware microarchitecture (x86, x86_64, ARM, ...) of a CPU or microcontroller.

cpu-architecture
Filter by
Sorted by
Tagged with
27243 votes
25 answers
1.9m views

Why is processing a sorted array faster than processing an unsorted array?

In this C++ code, sorting the data (before the timed region) makes the primary loop ~6x faster: #include <algorithm> #include <ctime> #include <iostream> int main() { // ...
GManNickG's user avatar
  • 499k
98 votes
3 answers
15k views

Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

LOOP (Intel ref manual entry) decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-...
Peter Cordes's user avatar
63 votes
4 answers
8k views

Micro fusion and addressing modes

I have found something unexpected (to me) using the Intel® Architecture Code Analyzer (IACA). The following instruction using [base+index] addressing addps xmm1, xmmword ptr [rsi+rax*1] does not ...
Z boson's user avatar
  • 33.1k
20 votes
1 answer
2k views

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular ...
geometrian's user avatar
  • 15.1k
36 votes
3 answers
9k views

Why doesn't GCC use partial registers?

Disassembling write(1,"hi",3) on linux, built with gcc -s -nostdlib -nostartfiles -O3 results in: ba03000000 mov edx, 3 ; thanks for the correction jester! bf01000000 mov edi, 1 31c0 ...
Ábrahám Endre's user avatar
47 votes
2 answers
8k views

Can x86's MOV really be "free"? Why can't I reproduce this at all?

I keep seeing people claim that the MOV instruction can be free in x86, because of register renaming. For the life of me, I can't verify this in a single test case. Every test case I try debunks ...
user541686's user avatar
  • 208k
268 votes
3 answers
54k views

How much of �What Every Programmer Should Know About Memory’ is still valid?

I am wondering how much of Ulrich Drepper's What Every Programmer Should Know About Memory from 2007 is still valid. Also I could not find a newer version than 1.0 or an errata. (Also in PDF form on ...
Framester's user avatar
  • 34.4k
105 votes
6 answers
33k views

Enhanced REP MOVSB for memcpy

I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy. ERMSB was introduced with the Ivy Bridge microarchitecture. See the section "Enhanced REP MOVSB and ...
Z boson's user avatar
  • 33.1k
53 votes
2 answers
4k views

How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent

This loop runs at one iteration per 3 cycles on Intel Conroe/Merom, bottlenecked on imul throughput as expected. But on Haswell/Skylake, it runs at one iteration per 11 cycles, apparently because ...
Peter Cordes's user avatar
77 votes
5 answers
81k views

How many CPU cycles are needed for each assembly instruction?

I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle ...
George2's user avatar
  • 45.2k
11 votes
1 answer
3k views

Can a speculatively executed CPU branch contain opcodes that access RAM?

As I understand, when a CPU speculatively executes a piece of code, it "backs up" the register state before switching to the speculative branch, so that if the prediction turns out wrong (...
golosovsky's user avatar
17 votes
2 answers
3k views

Are loads and stores the only instructions that gets reordered?

I have read many articles on memory ordering, and all of them only say that a CPU reorders loads and stores. Does a CPU (I'm specifically interested in an x86 CPU) only reorders loads and stores, and ...
James's user avatar
  • 733
41 votes
4 answers
9k views

Observing stale instruction fetching on x86 with self-modifying code

I've been told and have read from Intel's manuals that it is possible to write instructions to memory, but the instruction prefetch queue has already fetched the stale instructions and will execute ...
Chris's user avatar
  • 2,846
19 votes
2 answers
5k views

Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

We've got a simple memory throughput benchmark. All it does is memcpy repeatedly for a large block of memory. Looking at the results (compiled for 64-bit) on a few different machines, Skylake ...
aggieNick02's user avatar
  • 2,697
9 votes
1 answer
2k views

Adding a redundant assignment speeds up code when compiled without optimization

I find an interesting phenomenon: #include<stdio.h> #include<time.h> int main() { int p, q; clock_t s,e; s=clock(); for(int i = 1; i < 1000; i++){ for(int j = ...
helloqiu's user avatar
  • 133
19 votes
1 answer
9k views

Which cache mapping technique is used in intel core i7 processor?

I have learned about different cache mapping techniques like direct mapping and fully associative or set associative mapping, and the trade-offs between those. (Wikipedia) But I am curious which one ...
Subhadip's user avatar
  • 451
35 votes
3 answers
5k views

Is performance reduced when executing loops whose uop count is not a multiple of processor width?

I'm wondering how loops of various sizes perform on recent x86 processors, as a function of number of uops. Here's a quote from Peter Cordes who raised the issue of non-multiple-of-4 counts in ...
BeeOnRope's user avatar
  • 62.6k
52 votes
1 answer
22k views

gcc optimization flag -O3 makes code slower than -O2

I find this topic Why is it faster to process a sorted array than an unsorted array? . And try to run this code. And I find strange behavior. If I compile this code with -O3 optimization flag it takes ...
Mike Minaev's user avatar
  • 2,042
240 votes
4 answers
107k views

What is the purpose of the "Prefer 32-bit" setting in Visual Studio and how does it actually work?

It is unclear to me how the compiler will automatically know to compile for 64-bit when it needs to. How does it know when it can confidently target 32-bit? I am mainly curious about how the compiler ...
Aaron's user avatar
  • 10.6k
42 votes
7 answers
33k views

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors ?
Karthik Balaguru's user avatar
700 votes
4 answers
91k views

How do I achieve the theoretical maximum of 4 FLOPs per cycle?

How can the theoretical peak performance of 4 floating point operations (double precision) per cycle be achieved on a modern x86-64 Intel CPU? As far as I understand it takes three cycles for an SSE ...
user1059432's user avatar
  • 7,568
98 votes
9 answers
32k views

System where 1 byte != 8 bit? [duplicate]

All the time I read sentences like don't rely on 1 byte being 8 bit in size use CHAR_BIT instead of 8 as a constant to convert between bits and bytes et cetera. What real life systems are there ...
Xeo's user avatar
  • 131k
17 votes
2 answers
2k views

Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths

I was playing with the code in this answer, slightly modifying it: BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 1000000 .loop: ;T is a symbol defined with the CLI (-DT=...) TIMES T ...
Margaret Bloom's user avatar
12 votes
2 answers
1k views

Are there any modern CPUs where a cached byte store is actually slower than a word store?

It's a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register. But I've never seen any ...
Peter Cordes's user avatar
8 votes
3 answers
2k views

Globally Invisible load instructions

Can some of the load instructions be never globally visible due to store load forwarding ? To put it another way, if a load instruction gets its value from the store buffer, it never has to read from ...
joz's user avatar
  • 319
6 votes
2 answers
5k views

Why isn't movl from memory to memory allowed?

I was wondering if this is allowed in assembly, movl (%edx) (%eax) I would have guessed that it access the memory in the first operand and puts in the memory of the second operand, something like ...
nochillfam's user avatar
55 votes
3 answers
7k views

How are x86 uops scheduled, exactly?

Modern x86 CPUs break down the incoming instruction stream into micro-operations (uops1) and then schedule these uops out-of-order as their inputs become ready. While the basic idea is clear, I'd like ...
BeeOnRope's user avatar
  • 62.6k
42 votes
2 answers
4k views

Why does breaking the "output dependency" of LZCNT matter?

While benchmarking something I measured a much lower throughput than I had calculated, which I narrowed down to the LZCNT instruction (it also happens with TZCNT), as demonstrated in the following ...
harold's user avatar
  • 63.4k
23 votes
1 answer
5k views

Does lock xchg have the same behavior as mfence?

What I'm wondering is if lock xchg will have similar behavior to mfence from the perspective of one thread accessing a memory location that is being mutated (lets just say at random) by other threads. ...
Valarauca's user avatar
  • 1,061
12 votes
1 answer
3k views

What exactly happens when a skylake CPU mispredicts a branch?

I'm trying to understand in detail what happens to instructions in the various stages of the skylake CPU pipeline when a branch is mis-predicted, and how quickly instructions from the correct branch ...
Steve Linton's user avatar
20 votes
2 answers
3k views

What is the stack engine in the Sandybridge microarchitecture?

I am reading http://www.realworldtech.com/sandy-bridge/ and I'm facing some problems in understanding some issues: The dedicated stack pointer tracker is also present in Sandy Bridge and renames ...
Gilgamesz's user avatar
  • 4,883
17 votes
2 answers
2k views

Problems with ADC/SBB and INC/DEC in tight loops on some CPUs

I am writing a simple BigInteger type in Delphi. It mainly consists of a dynamic array of TLimb, where a TLimb is a 32 bit unsigned integer, and a 32 bit size field, which also holds the sign bit for ...
Rudy Velthuis's user avatar
33 votes
2 answers
18k views

When an interrupt occurs, what happens to instructions in the pipeline?

Assume a 5-stage pipeline architecture (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). There are 4 instructions that has to be executed. ...
Raghupathy's user avatar
32 votes
1 answer
8k views

What happens after a L2 TLB miss?

I'm struggling to understand what happens when the first two levels of the Translation Lookaside Buffer result in misses? I am unsure whether "page walking" occurs in special hardware circuitry, or ...
user997112's user avatar
  • 29.9k
8 votes
2 answers
2k views

How does MIPS I handle branching on the previous ALU instruction without stalling?

addiu $6,$6,5 bltz $6,$L5 nop ... $L5: How is this safe without stalling, which classic MIPS couldn't even do, except on cache miss? (MIPS originally stood for ...
Peter Cordes's user avatar
60 votes
2 answers
70k views

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core ...
user avatar
138 votes
6 answers
127k views

What is the "FS"/"GS" register intended for?

So I know what the following registers and their uses are supposed to be: CS = Code Segment (used for IP) DS = Data Segment (used for MOV) ES = Destination Segment (used for MOVS, etc.) SS = Stack ...
user541686's user avatar
  • 208k
18 votes
4 answers
5k views

What setup does REP do?

Quoting Intel® 64 and IA-32 architectures optimization reference manual, §2.4.6 "REP String Enhancement": The performance characteristics of using REP string can be attributed to two components: ...
edmz's user avatar
  • 8,374
14 votes
2 answers
2k views

What is a Partial Flag Stall?

I was just going over this answer by Peter Cordes and he says, Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be ...
Evan Carroll's user avatar
13 votes
4 answers
4k views

How does memory reordering help processors and compilers?

I studied the Java memory model and saw re-ordering problems. A simple example: boolean first = false; boolean second = false; void setValues() { first = true; second = true; } void ...
aleksandr-chermashentsev's user avatar
12 votes
1 answer
4k views

Slow jmp-instruction

As follow up to my question The advantages of using 32bit registers/instructions in x86-64, I started to measure the costs of instructions. I'm aware that this have been done multiple times (e.g. ...
ead's user avatar
  • 33.7k
348 votes
4 answers
50k views

Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs

I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions: Your ...
Cowmoogun's user avatar
  • 2,557
129 votes
11 answers
140k views

Floating point vs integer calculations on modern hardware

I am doing some performance critical work in C++, and we are currently using integer calculations for problems that are inherently floating point because "its faster". This causes a whole ...
maxpenguin's user avatar
  • 5,099
37 votes
2 answers
22k views

Atomicity of loads and stores on x86

8.1.2 Bus Locking Intel 64 and IA-32 processors provide a LOCK# signal that is asserted automatically during certain critical memory operations to lock the system bus or equivalent link. While this ...
Gilgamesz's user avatar
  • 4,883
9 votes
2 answers
10k views

why we can't move a 64-bit immediate value to memory?

First I am a little bit confused with the differences between movq and movabsq, my text book says: The regular movq instruction can only have immediate source operands that can be represented as 32-...
user avatar
4 votes
2 answers
2k views

Latency bounds and throughput bounds for processors for operations that must occur in sequence

My textbook (Computer Systems: A programmer's perspective) states that a latency bound is encountered when a series of operations must be performed in strict sequence, while a throughput bound ...
mooglin's user avatar
  • 514
13 votes
3 answers
3k views

Avoid stalling pipeline by calculating conditional early

When talking about the performance of ifs, we usually talk about how mispredictions can stall the pipeline. The recommended solutions I see are: Trust the branch predictor for conditions that usually ...
Jibb Smart's user avatar
229 votes
5 answers
131k views

How do cache lines work?

I understand that the processor brings data into the cache via cache lines, which - for instance, on my Atom processor - brings in about 64 bytes at a time, whatever the size of the actual data being ...
Norswap's user avatar
  • 12k
63 votes
10 answers
117k views

Maximum memory which malloc can allocate

I was trying to figure out how much memory I can malloc to maximum extent on my machine (1 Gb RAM 160 Gb HD Windows platform). I read that the maximum memory malloc can allocate is limited to ...
Vikas's user avatar
  • 1,442
10 votes
2 answers
3k views

Is LFENCE serializing on AMD processors?

In recent Intel ISA documents the lfence instruction has been defined as serializing the instruction stream (preventing out-of-order execution across it). In particular, the description of the ...
BeeOnRope's user avatar
  • 62.6k

1
2 3 4 5
…
18