Is there a penalty when base+offset is in a different page than the base?

The execution times for these three snippets:

pageboundary: dq (pageboundary + 8) ... mov rdx, [rel pageboundary] .loop: mov rdx, [rdx - 8] sub ecx, 1 jnz .loop

And this:

pageboundary: dq (pageboundary - 8) ... mov rdx, [rel pageboundary] .loop: mov rdx, [rdx + 8] sub ecx, 1 jnz .loop

And this:

pageboundary: dq (pageboundary - 4096) ... mov rdx, [rel pageboundary] .loop: mov rdx, [rdx + 4096] sub ecx, 1 jnz .loop

Are, on a 4770K, roughly 5 cycles per iteration for the first snippet and roughly 9 cycles per iteration for the second snippet, then 5 cycles for the third snippet. They both access the exact same address, which is 4K-aligned. In the second snippet, only the address calculation crosses the page boundary: rdx and rdx + 8 don't belong to the same page, the load is still aligned. With a large offset it's back to 5 cycles again.

rdx

rdx + 8

How does this effect work in general?

Routing the result from the load through an ALU instruction like this:

.loop: mov rdx, [rdx + 8] or rdx, 0 sub ecx, 1 jnz .loop

Makes it take 6 cycles per iteration, which makes sense as 5+1. Reg+8 should be a special fast load and AFAIK take 4 cycles, so even in this case there seems to be some penalty, but only 1 cycle.

A test like this was used in response to some of the comments:

.loop: lfence ; or rdx, 0 mov rdx, [rdx + 8] ; or rdx, 0 ; uncomment one of the ORs lfence sub ecx, 1 jnz .loop

Putting the or before the mov makes the loop faster than without any or, putting the or after the mov makes it a cycle slower.

or

mov

or

mov

That's weird. I don't think Intel's docs mention this failure for SnB-family's [base + 0..2047] special case 4-cycle load-use latency, but it's plausible that it's based on using the base reg to start a TLB check before an add, and is slower if it turns out they're in different pages. (And BTW, that special case is only when forwarding to another addressing mode, not to an ALU instruction.)

– Peter Cordes
Sep 16 '18 at 6:53

[base + 0..2047]

Yes inserting an ALU instruction into the dep chain decreases the total latency, which is pretty funny (like a negative-latency instruction)

– harold
Sep 16 '18 at 6:55

Feeding an ALU instruction always disables the 4-cycle pointer-chasing fast path. You'd get 6 cycles from that loop even without any page-crossing shenanigans, including with mov rdx, [rdx] / and rdx,rdx.

– Peter Cordes
Sep 16 '18 at 8:03

mov rdx, [rdx]

and rdx,rdx

This is a really good find. I've added this effect to the Intel Performance Quirks page with links to the question and @PeterCordes' answer.

– BeeOnRope
Sep 17 '18 at 21:11

I tested this on Ryzen and didn't see any similar effect: the loop still executes at 4 cycles with the loads on different pages. Ryzen also doesn't have the restriction of the load address needing to come from a load: with a 1 cycle ALU added, the total latency goes up to 5 cycles (4 + 1), versus 6 cycles on Intel (since the load takes 5 cycles itself in that case).

– BeeOnRope
Sep 17 '18 at 23:04

2 Answers
2

Optimization rule: in pointer-connected data structures like linked-lists / trees, put the next or left/right pointers in the first 16 bytes of the object. malloc typically returns 16-byte aligned blocks (alignof(maxalign_t)), so this will ensure the linking pointers are in the same page as the start of the object.

next

left

right

malloc

alignof(maxalign_t)

Any other way of ensuring that important struct members are in the same page as the start of the object will also work.

Sandybridge-family normally has 5 cycle L1d load-use latency, but there's a special case for pointer-chasing with small positive displacements with base+disp addressing modes.

Sandybridge-family has 4 cycle load-use latency for [reg + 0..2047] addressing modes, when the base reg is the result of a mov load, not an ALU instruction. Or a penalty if reg+disp is in a different page than reg.

[reg + 0..2047]

mov

reg+disp

reg

Based on these test results on Haswell and Skylake (and probably original SnB but we don't know), it appears that all of the following conditions must be true:

[reg]

[reg+disp8/disp32]

(Skylake but not Haswell/Broadwell): the last load wasn't a retried-fastpath. (So base = result of a 4 or 5 cycle load, it will attempt the fast path. But base = result of a 10 cycle retried load, it won't. The penalty on SKL seems to be 10, vs. 9 on HSW).

I don't know if it's the last load attempted on that load port that matters, or if it's actually what happened to the load that produced that input. Perhaps experiments chasing two dep chains in parallel could shed some light; I've only tried one pointer chasing dep chain with a mix of page-changing and non-page-changing displacements.

If all those things are true, the load port speculates that the final effective address will be in the same page as the base register. This is a useful optimization in real cases when load-use latency forms a loop-carried dep chain, like for a linked list or binary tree.

microarchitectural explanation (my best guess at explaining the result, not from anything Intel published):

It seems that indexing the L1dTLB is on the critical path for L1d load latency. Starting that 1 cycle early (without waiting for the output of an adder to calculate the final address) shaves a cycle off the full process of indexing L1d using the low 12 bits of the address, then comparing the 8 tags in that set against the high bits of the physical address produced by the TLB. (Intel's L1d is VIPT 8-way 32kiB, so it has no aliasing problems because the index bits all come from the low 12 bits of the address: the offset within a page which is the same in both the virtual and physical address. i.e. the low 12 bits translate for free from virt to phys.)

Since we don't find an effect for crossing 64-byte boundaries, we know the load port is adding the displacement before indexing the cache.

As Hadi suggests, it seems likely that if there's carry-out from bit 11, the load port lets the wrong-TLB load complete and then redoes it using the normal path. (On HSW, the total load latency = 9. On SKL the total load latency can be 7.5 or 10).

Aborting right away and retrying on the next cycle (to make it 5 or 6 cycles instead of 9) would in theory be possible, but remember that the load ports are pipelined with 1 per clock throughput. The scheduler is expecting to be able to send another uop to the load port in the next cycle, and Sandybridge-family standardizes latencies for everything of 5 cycles and shorter. (There are no 2-cycle instructions).

I didn't test if 2M hugepages help, but probably not. I think the TLB hardware is simple enough that it couldn't recognize that a 1-page-higher index would still pick the same entry. So it probably does the slow retry any time the displacement crosses a 4k boundary, even if that's in the same hugepage. (Page-split loads work this way: if the data actually crosses a 4k boundary (e.g. 8-byte load from page-4), you pay the page-split penalty not just the cache-line split penalty, regardless of hugepages)

Intel's optimization manual documents this special case in section 2.4.5.2 L1 DCache (in the Sandybridge section), but doesn't mention any different-page limitation, or the fact that it's only for pointer-chasing, and doesn't happen when there's an ALU instruction in the dep chain.

(Sandybridge) Table 2-21. Effect of Addressing Modes on Load Latency ----------------------------------------------------------------------- Data Type | Base + Offset > 2048 | Base + Offset < 2048 | Base + Index [+ Offset] | ----------------------+--------------------------+---------------------- Integer | 5 | 4 MMX, SSE, 128-bit AVX | 6 | 5 X87 | 7 | 6 256-bit AVX | 7 | 7 (remember, 256-bit loads on SnB take 2 cycles in the load port, unlike on HSW/SKL)

The text around this table also doesn't mention the limitations that exist on Haswell/Skylake, and may also exist on SnB (I don't know).

Maybe Sandybridge doesn't have those limitations and Intel didn't document the Haswell regression, or else Intel just didn't document the limitations in the first place. The table is pretty definite about that addressing mode always being 4c latency with offset = 0..2047.

@Harold's experiment of putting an ALU instruction as part of the load/use pointer-chasing dependency chain confirms that it's this effect that's causing the slowdown: an ALU insn decreased the total latency, effectively giving an instruction like and rdx, rdx negative incremental latency when added to the mov rdx, [rdx-8] dep chain in this specific page-crossing case.

and rdx, rdx

mov rdx, [rdx-8]

Previous guesses in this answer included the suggestion that using the load result in an ALU vs. another load was what determined the latency. That would be super weird and require looking into the future. That was a wrong interpretation on my part of the effect of adding an ALU instruction into the loop. (I hadn't known about the 9-cycle effect on page crossing, and was thinking that the HW mechanism was a forwarding fast-path for the result inside the load port. That would make sense.)

We can prove that it's the source of the base reg input that matters, not the destination of the load result: Store the same address at 2 separate locations, before and after a page boundary. Create a dep chain of ALU => load => load, and check that it's the 2nd load that's vulnerable to this slowdown / able to benefit from the speedup with a simple addressing mode.

%define off 16 lea rdi, [buf+4096 - 16] mov [rdi], rdi mov [rdi+off], rdi mov ebp, 100000000 .loop: and rdi, rdi mov rdi, [rdi] ; base comes from AND mov rdi, [rdi+off] ; base comes from a load dec ebp jnz .loop ... sys_exit_group(0) section .bss align 4096 buf: resb 4096*2

Timed with Linux perf on SKL i7-6700k.

perf

off = 8, the speculation is correct and we get total latency = 10 cycles = 1 + 5 + 4. (10 cycles per iteration).

off = 8

off = 16, the [rdi+off] load is slow, and we get 16 cycles / iter = 1 + 5 + 10. (The penalty seems to be higher on SKL than HSW)

off = 16

[rdi+off]

With the load order reversed (doing the [rdi+off] load first), it's always 10c regardless of off=8 or off=16, so we've proved that mov rdi, [rdi+off] doesn't attempt the speculative fast-path if its input is from an ALU instruction.

[rdi+off]

mov rdi, [rdi+off]

Without the and, and off=8, we get the expected 8c per iter: both use the fast path. (@harold confirms HSW also gets 8 here).

and

off=8

Without the and, and off=16, we get 15c per iter: 5+10. The mov rdi, [rdi+16] attempts the fast path and fails, taking 10c. Then mov rdi, [rdi] doesn't attempt the fast-path because its input failed. (@harold's HSW takes 13 here: 4 + 9. So that confirms HSW does attempt the fast-path even if the last fast-path failed, and that the fast-path fail penalty really is only 9 on HSW vs. 10 on SKL)

and

off=16

mov rdi, [rdi+16]

mov rdi, [rdi]

It's unfortunate that SKL doesn't realize that [base] with no displacement can always safely use the fast path.

[base]

On SKL, with just mov rdi, [rdi+16] in the loop, the average latency is 7.5 cycles. Based on tests with other mixes, I think it alternates between 5c and 10c: after a 5c load that didn't attempt the fast path, the next one does attempt it and fails, taking 10c. That makes the next load use the safe 5c path.

mov rdi, [rdi+16]

Adding a zeroed index register actually speeds it up in this case where we know the fast-path is always going to fail. Or using no base register, like [nosplit off + rdi*1], which NASM assembles to 48 8b 3c 3d 10 00 00 00 mov rdi,QWORD PTR [rdi*1+0x10]. Notice that this requires a disp32, so it's bad for code size.

[nosplit off + rdi*1]

48 8b 3c 3d 10 00 00 00 mov rdi,QWORD PTR [rdi*1+0x10]

Also beware that indexed addressing modes for micro-fused memory operands are un-laminated in some cases, while base+disp modes aren't. But if you're using pure loads (like mov or vbroadcastss), there's nothing inherently wrong with an indexed addressing mode. Using an extra zeroed register isn't great, though.

mov

vbroadcastss

Comments are not for extended discussion; this conversation has been moved to chat.

– Samuel Liew♦
Sep 19 '18 at 6:35

Sandy Bridge actually has a performance event, AGU_BYPASS_CANCEL.COUNT whose name and description pretty much explains the effect: This event counts executed load operations with all the following traits: 1. addressing of the format [base + offset], 2. the offset is between 1 and 2047, 3. the address specified in the base register is in one page and the address [base+offset] is in an. (yes, it ends abruptly like that). The "between 1" part seems wrong since as you point out it happens even for zero offsets.

– BeeOnRope
Oct 13 '18 at 3:41

AGU_BYPASS_CANCEL.COUNT

I think I found the Intel patent that describes this particular optimization. It's pretty old. It says: "The invention has the advantages that it improves the "base-plus-displacement/offset" and the "scaled-index-plus-displacement" addressing modes by one clock most of the time." We have verified this for base+disp but not the latter. Also it's not clear to me how the terms disp and offset are used in the patent.

– Hadi Brais
Nov 26 '18 at 2:01

On Sandy Bridge, the fast path is 4c, the slow path is 5c, and the mispredict path is 9c, just like Haswell. On Ivy Bridge, the fast path is 4c, the slow path is around 4.6c, and the mispredict path is around 8.7c. On both, indexed addressing mode does not use the 4c path even if the index register is zeroed using a zeroing idiom. I don't know how to interpret the IvB numbers. On both, the AGU_BYPASS_CANCEL performance event can be used to count fast path mispredicts. This counter does not work on Haswell.

– Hadi Brais
Jan 14 at 20:34

AGU_BYPASS_CANCEL

BTW, I have now temporary access to IvB and SnB (and hopefully soon Westmere). So if we need to run any experiments on them, let me know.

– Hadi Brais
Jan 14 at 20:35

I've conducted a sufficient number of experiments on Haswell to determine exactly when memory loads are issued speculatively before the effective address is fully calculated. These results also confirm Peter's guess.

I've varied the following parameters:

pageboundary

In all of the following graphs, the Y axis represents the load latency in core cycles. The X axis represents the configuration in the form NS1S2, where N is the offset, S1 is the sign of the offset used in the definition, and S2 is the sign used in the load instruction.

The following graph shows that loads are issued before calculating the effective address only when the offset is positive or zero. Note that for all of the offsets between 0-15, the base address and the effective address used in the load instruction are both within the same 4K page.

enter image description here

The next graph shows the point where this pattern changes. The change occurs at offset 213, which is the smallest offset where the base address and the effective address used in the load instruction are both within different 4K pages.

enter image description here

Another important observation that can be made from the previous two graphs is that even if the base address points to a different cache set than the effective address, no penalty is incurred. So it seems that the cache set is opened after calculating the effective address. This indicates that the L1 DTLB hit latency is 2 cycles (that is, it takes 2 cycles for the L1D to receive the tag), but it takes only 1 cycle to open the cache's data array set and the cache's tag array set (which occurs in parallel).

The next graph shows what happens when pageboundary is aligned on a 4K page boundary. In this case, any offset that is not zero will make the base and effective addresses reside within different pages. For example, if the base address of pageboundary is 4096, then the base address of pageboundary used in the load instruction is 4096 - offset, which is obviously in a different 4K page for any non-zero offset.

pageboundary

enter image description here

The next graph shows that the pattern changes again starting from offset 2048. At this point, loads are never issued before calculating the effective address.

enter image description here

This analysis can be confirmed by measuring the number of uops dispatched to the load ports 2 and 3. The total number of retired load uops is 1 billion (equal to the number of iterations). However, when the measured load latency is 9 cycles, the number of load uops dispatched to each of the two ports is 1 billion. Also when the load latency is 5 or 4 cycles, the number of load uops dispatched to each of the two ports is 0.5 billion. So something like this would be happening:

These steps explain the observed 4, 5, and 9 cycle latencies.

It might happen that the target page is a hugepage. The only way for the load unit to know whether the base address and the effective address point to the same page when using hugepages is to have the TLB supply the load unit with the size of the page being accessed. Then the load unit has to check whether the effective address is within that page. In modern processors, on a TLB miss, dedicated page-walk hardware is used. In this case, I think that the load unit will not supply the cache set index and cache line offset to the data cache and will use the actual effective address to access the TLB. This requires enabling the page-walk hardware to distinguish between loads with speculative addresses and other loads. Only if that other access missed the TLB will the page walk take place. Now if the target page turned out to be a hugepage and it's a hit in the TLB, it might be possible to inform the load unit that the size of the page is larger than 4K or maybe even of the exact size of the page. The load unit can then make a better decision regarding whether the load should be replayed. However, this logic should take no more than the time for the (potentially wrong) data to reach the load buffer allocated for the load. I think this time is only one cycle.

The next sentence in Intel's manual after "can be" is "However, overall latency varies depending on the target register data type due to stack bypass". This very much gives the impression they only said can because it only applies to GP integer. The table does explicitly say that GP integer loads with that addressing mode are 4 cycles, not 4 or 9 cycles. I don't think Intel's weasel words were sufficient to make their manual not wrong for HSW. I'm curious whether we still have the same effect on first-gen SnB, which is what's being documented in that part of the manual.

– Peter Cordes
Sep 16 '18 at 23:04

HW page walk is not microcoded; there is dedicated page-walk hardware that does its own cache loads separate from the load ports. What happens after a L2 TLB miss?. Fun fact: in P5 and earlier, the page-walk hardware bypassed the cache (so trapping to a software page walk was actually faster), but P6-family's page walker does cached loads. Are page table walks cached?

– Peter Cordes
Sep 17 '18 at 2:51

BTW, your graphs would be easier to follow if they weren't alternating positive/negative. We know from previous experiment and Intel's manuals that there's never anything weird about [base - constant], so these sawtooths are unexpected / hard-to-follow. You have to carefully read the legend to distinguish +- from -+, and I wouldn't have been able to easily follow which was which if I didn't already know that only positive displacements (negative relative offset in your terminology) could ever be 4 or 9. Especially since the titles just say 0..n, it's unexpected for that to be magnitude.

– Peter Cordes
Sep 17 '18 at 3:45

[base - constant]

In your new last paragraph, I'm not sure what point you're making about TLB misses and page walks. I think you have multiple points here. 1. on TLB miss, we have to send the correct address to the page walker, not the speculative one. But mis-speculation can be detected before the first TLB check even completes, as you say in a single cycle (checking for carry-out into the page number from an add it had to do anyway). Oh, and I think you're saying that on mis-speculation it might avoid fetching the data+tags for that set of the VIPT L1d cache? Makes sense, good power optimization.

– Peter Cordes
Sep 17 '18 at 3:53

And 2. you're making the point that if the TLB check included page sizes, it could maybe avoid a replay on crossing a 4k boundary inside a hugepage, but I didn't follow the last sentence.

– Peter Cordes
Sep 17 '18 at 3:56

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

搜尋此網誌

Dfyjkt