Frequent 'cpu-architecture' Questions

27243 votes

25 answers

1.9m views

Why is processing a sorted array faster than processing an unsorted array?

In this C++ code, sorting the data (before the timed region) makes the primary loop ~6x faster: #include <algorithm> #include <ctime> #include <iostream> int main() { // ...

GManNickG

499k

asked Jun 27, 2012 at 13:51

98 votes

3 answers

15k views

Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

LOOP (Intel ref manual entry) decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-...

Peter Cordes

348k

asked Mar 2, 2016 at 9:01

63 votes

4 answers

8k views

Micro fusion and addressing modes

I have found something unexpected (to me) using the IntelÂ® Architecture Code Analyzer (IACA). The following instruction using [base+index] addressing addps xmm1, xmmword ptr [rsi+rax*1] does not ...

Z boson

33.1k

asked Sep 25, 2014 at 19:33

20 votes

1 answer

2k views

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular ...

geometrian

15.1k

asked Jul 31, 2018 at 7:08

36 votes

3 answers

9k views

Why doesn't GCC use partial registers?

Disassembling write(1,"hi",3) on linux, built with gcc -s -nostdlib -nostartfiles -O3 results in: ba03000000 mov edx, 3 ; thanks for the correction jester! bf01000000 mov edi, 1 31c0 ...

Ábrahám Endre

669

asked Jan 10, 2017 at 16:23

47 votes

2 answers

8k views

Can x86's MOV really be "free"? Why can't I reproduce this at all?

I keep seeing people claim that the MOV instruction can be free in x86, because of register renaming. For the life of me, I can't verify this in a single test case. Every test case I try debunks ...

user541686

208k

asked May 24, 2017 at 22:16

268 votes

3 answers

54k views

How much of â€?What Every Programmer Should Know About Memoryâ€™ is still valid?

I am wondering how much of Ulrich Drepper's What Every Programmer Should Know About Memory from 2007 is still valid. Also I could not find a newer version than 1.0 or an errata. (Also in PDF form on ...

Framester

34.4k

asked Nov 14, 2011 at 18:30

105 votes

6 answers

33k views

Enhanced REP MOVSB for memcpy

I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy. ERMSB was introduced with the Ivy Bridge microarchitecture. See the section "Enhanced REP MOVSB and ...

Z boson

33.1k

asked Apr 11, 2017 at 10:22

53 votes

2 answers

4k views

How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent

This loop runs at one iteration per 3 cycles on Intel Conroe/Merom, bottlenecked on imul throughput as expected. But on Haswell/Skylake, it runs at one iteration per 11 cycles, apparently because ...

Peter Cordes

348k

asked Aug 13, 2017 at 12:05

77 votes

5 answers

81k views

How many CPU cycles are needed for each assembly instruction?

I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle ...

George2

45.2k

asked Mar 28, 2009 at 12:46

11 votes

1 answer

3k views

Can a speculatively executed CPU branch contain opcodes that access RAM?

As I understand, when a CPU speculatively executes a piece of code, it "backs up" the register state before switching to the speculative branch, so that if the prediction turns out wrong (...

golosovsky

658

asked Sep 30, 2020 at 15:57

17 votes

2 answers

3k views

Are loads and stores the only instructions that gets reordered?

I have read many articles on memory ordering, and all of them only say that a CPU reorders loads and stores. Does a CPU (I'm specifically interested in an x86 CPU) only reorders loads and stores, and ...

James

733

asked May 23, 2018 at 17:57

41 votes

4 answers

9k views

Observing stale instruction fetching on x86 with self-modifying code

I've been told and have read from Intel's manuals that it is possible to write instructions to memory, but the instruction prefetch queue has already fetched the stale instructions and will execute ...

Chris

2,846

asked Jun 30, 2013 at 22:52

19 votes

2 answers

5k views

Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

We've got a simple memory throughput benchmark. All it does is memcpy repeatedly for a large block of memory. Looking at the results (compiled for 64-bit) on a few different machines, Skylake ...

aggieNick02

2,697

asked Aug 31, 2016 at 22:32

9 votes

1 answer

2k views

Adding a redundant assignment speeds up code when compiled without optimization

I find an interesting phenomenon: #include<stdio.h> #include<time.h> int main() { int p, q; clock_t s,e; s=clock(); for(int i = 1; i < 1000; i++){ for(int j = ...

helloqiu

133

asked Mar 9, 2018 at 8:41

19 votes

1 answer

9k views

Which cache mapping technique is used in intel core i7 processor?

I have learned about different cache mapping techniques like direct mapping and fully associative or set associative mapping, and the trade-offs between those. (Wikipedia) But I am curious which one ...

Subhadip

451

asked Mar 4, 2018 at 6:11

35 votes

3 answers

5k views

Is performance reduced when executing loops whose uop count is not a multiple of processor width?

I'm wondering how loops of various sizes perform on recent x86 processors, as a function of number of uops. Here's a quote from Peter Cordes who raised the issue of non-multiple-of-4 counts in ...

BeeOnRope

62.6k

asked Sep 3, 2016 at 22:28

52 votes

1 answer

22k views

gcc optimization flag -O3 makes code slower than -O2

I find this topic Why is it faster to process a sorted array than an unsorted array? . And try to run this code. And I find strange behavior. If I compile this code with -O3 optimization flag it takes ...

Mike Minaev

2,042

asked Mar 5, 2015 at 10:17

240 votes

4 answers

107k views

What is the purpose of the "Prefer 32-bit" setting in Visual Studio and how does it actually work?

It is unclear to me how the compiler will automatically know to compile for 64-bit when it needs to. How does it know when it can confidently target 32-bit? I am mainly curious about how the compiler ...

Aaron

10.6k

asked Aug 22, 2012 at 5:13

42 votes

7 answers

33k views

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors ?

Karthik Balaguru

7,634

asked Jan 12, 2011 at 8:41

700 votes

4 answers

91k views

How do I achieve the theoretical maximum of 4 FLOPs per cycle?

How can the theoretical peak performance of 4 floating point operations (double precision) per cycle be achieved on a modern x86-64 Intel CPU? As far as I understand it takes three cycles for an SSE ...

user1059432

7,568

asked Dec 5, 2011 at 17:54

98 votes

9 answers

32k views

System where 1 byte != 8 bit? [duplicate]

All the time I read sentences like don't rely on 1 byte being 8 bit in size use CHAR_BIT instead of 8 as a constant to convert between bits and bytes et cetera. What real life systems are there ...

Xeo

131k

asked Apr 1, 2011 at 16:16

17 votes

2 answers

2k views

Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths

I was playing with the code in this answer, slightly modifying it: BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 1000000 .loop: ;T is a symbol defined with the CLI (-DT=...) TIMES T ...

Margaret Bloom

43k

asked Aug 23, 2018 at 12:39

12 votes

2 answers

1k views

Are there any modern CPUs where a cached byte store is actually slower than a word store?

It's a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register. But I've never seen any ...

Peter Cordes

348k

asked Jan 16, 2019 at 12:54

8 votes

3 answers

2k views

Globally Invisible load instructions

Can some of the load instructions be never globally visible due to store load forwarding ? To put it another way, if a load instruction gets its value from the store buffer, it never has to read from ...

joz

319

asked May 30, 2018 at 16:56

6 votes

2 answers

5k views

Why isn't movl from memory to memory allowed?

I was wondering if this is allowed in assembly, movl (%edx) (%eax) I would have guessed that it access the memory in the first operand and puts in the memory of the second operand, something like ...

nochillfam

73

asked Nov 19, 2015 at 2:23

55 votes

3 answers

7k views

How are x86 uops scheduled, exactly?

Modern x86 CPUs break down the incoming instruction stream into micro-operations (uops1) and then schedule these uops out-of-order as their inputs become ready. While the basic idea is clear, I'd like ...

BeeOnRope

62.6k

asked Nov 18, 2016 at 15:58

42 votes

2 answers

4k views

Why does breaking the "output dependency" of LZCNT matter?

While benchmarking something I measured a much lower throughput than I had calculated, which I narrowed down to the LZCNT instruction (it also happens with TZCNT), as demonstrated in the following ...

harold

63.4k

asked Jan 27, 2014 at 19:45

23 votes

1 answer

5k views

Does lock xchg have the same behavior as mfence?

What I'm wondering is if lock xchg will have similar behavior to mfence from the perspective of one thread accessing a memory location that is being mutated (lets just say at random) by other threads. ...

Valarauca

1,061

asked Nov 3, 2016 at 18:59

12 votes

1 answer

3k views

What exactly happens when a skylake CPU mispredicts a branch?

I'm trying to understand in detail what happens to instructions in the various stages of the skylake CPU pipeline when a branch is mis-predicted, and how quickly instructions from the correct branch ...

Steve Linton

359

asked Jun 22, 2018 at 8:41

20 votes

2 answers

3k views

What is the stack engine in the Sandybridge microarchitecture?

I am reading http://www.realworldtech.com/sandy-bridge/ and I'm facing some problems in understanding some issues: The dedicated stack pointer tracker is also present in Sandy Bridge and renames ...

Gilgamesz

4,883

asked Apr 14, 2016 at 18:50

17 votes

2 answers

2k views

Problems with ADC/SBB and INC/DEC in tight loops on some CPUs

I am writing a simple BigInteger type in Delphi. It mainly consists of a dynamic array of TLimb, where a TLimb is a 32 bit unsigned integer, and a 32 bit size field, which also holds the sign bit for ...

Rudy Velthuis

28.6k

asked Aug 18, 2015 at 23:25

33 votes

2 answers

18k views

When an interrupt occurs, what happens to instructions in the pipeline?

Assume a 5-stage pipeline architecture (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). There are 4 instructions that has to be executed. ...

Raghupathy

863

asked Jan 17, 2012 at 21:44

32 votes

1 answer

8k views

What happens after a L2 TLB miss?

I'm struggling to understand what happens when the first two levels of the Translation Lookaside Buffer result in misses? I am unsure whether "page walking" occurs in special hardware circuitry, or ...

user997112

29.9k

asked Aug 27, 2015 at 17:51

8 votes

2 answers

2k views

How does MIPS I handle branching on the previous ALU instruction without stalling?

addiu $6,$6,5 bltz $6,$L5 nop ... $L5: How is this safe without stalling, which classic MIPS couldn't even do, except on cache miss? (MIPS originally stood for ...

Peter Cordes

348k

asked Jun 13, 2019 at 18:25

60 votes

2 answers

70k views

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core ...

user2088790

asked Mar 27, 2013 at 9:48

138 votes

6 answers

127k views

What is the "FS"/"GS" register intended for?

So I know what the following registers and their uses are supposed to be: CS = Code Segment (used for IP) DS = Data Segment (used for MOV) ES = Destination Segment (used for MOVS, etc.) SS = Stack ...

user541686

208k

asked May 30, 2012 at 4:57

18 votes

4 answers

5k views

What setup does REP do?

Quoting IntelÂ® 64 and IA-32 architectures optimization reference manual, Â§2.4.6 "REP String Enhancement": The performance characteristics of using REP string can be attributed to two components: ...

edmz

8,374

asked Nov 24, 2015 at 19:21

14 votes

2 answers

2k views

What is a Partial Flag Stall?

I was just going over this answer by Peter Cordes and he says, Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be ...

Evan Carroll

1

asked Apr 16, 2018 at 23:21

13 votes

4 answers

4k views

How does memory reordering help processors and compilers?

I studied the Java memory model and saw re-ordering problems. A simple example: boolean first = false; boolean second = false; void setValues() { first = true; second = true; } void ...

aleksandr-chermashentsev

227

asked Jun 9, 2016 at 12:04

12 votes

1 answer

4k views

Slow jmp-instruction

As follow up to my question The advantages of using 32bit registers/instructions in x86-64, I started to measure the costs of instructions. I'm aware that this have been done multiple times (e.g. ...

ead

33.7k

asked Aug 7, 2016 at 7:23

348 votes

4 answers

50k views

Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs

I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions: Your ...

Cowmoogun

2,557

asked May 21, 2016 at 9:29

129 votes

11 answers

140k views

Floating point vs integer calculations on modern hardware

I am doing some performance critical work in C++, and we are currently using integer calculations for problems that are inherently floating point because "its faster". This causes a whole ...

maxpenguin

5,099

asked Mar 31, 2010 at 3:15

37 votes

2 answers

22k views

Atomicity of loads and stores on x86

8.1.2 Bus Locking Intel 64 and IA-32 processors provide a LOCK# signal that is asserted automatically during certain critical memory operations to lock the system bus or equivalent link. While this ...

Gilgamesz

4,883

asked Jul 18, 2016 at 23:02

9 votes

2 answers

10k views

why we can't move a 64-bit immediate value to memory?

First I am a little bit confused with the differences between movq and movabsq, my text book says: The regular movq instruction can only have immediate source operands that can be represented as 32-...

user9623401

asked Jul 7, 2020 at 8:42

4 votes

2 answers

2k views

Latency bounds and throughput bounds for processors for operations that must occur in sequence

My textbook (Computer Systems: A programmer's perspective) states that a latency bound is encountered when a series of operations must be performed in strict sequence, while a throughput bound ...

mooglin

514

asked Jul 26, 2020 at 2:01

13 votes

3 answers

3k views

Avoid stalling pipeline by calculating conditional early

When talking about the performance of ifs, we usually talk about how mispredictions can stall the pipeline. The recommended solutions I see are: Trust the branch predictor for conditions that usually ...

Jibb Smart

295

asked Apr 20, 2018 at 0:26

229 votes

5 answers

131k views

How do cache lines work?

I understand that the processor brings data into the cache via cache lines, which - for instance, on my Atom processor - brings in about 64 bytes at a time, whatever the size of the actual data being ...

Norswap

12k

asked Oct 13, 2010 at 23:56

63 votes

10 answers

117k views

Maximum memory which malloc can allocate

I was trying to figure out how much memory I can malloc to maximum extent on my machine (1 Gb RAM 160 Gb HD Windows platform). I read that the maximum memory malloc can allocate is limited to ...

Vikas

1,442

asked May 9, 2010 at 16:31

10 votes

2 answers

3k views

Is LFENCE serializing on AMD processors?

In recent Intel ISA documents the lfence instruction has been defined as serializing the instruction stream (preventing out-of-order execution across it). In particular, the description of the ...

BeeOnRope

62.6k

asked Aug 14, 2018 at 15:26

Collectivesâ„¢ on Stack Overflow

Questions tagged [cpu-architecture]

Related Tags