vgetmantps vs andpd instructions for getting the mantissa of float

For skylakex (agner fog's instruction tables):

+-----------------------+-------------+-------------------+---------------------+----------------+---------+-----------------------+ | Instruction | Operands | µops fused domain | µops unfused domain | µops each port | Latency | Reciprocal throughput | +-----------------------+-------------+-------------------+---------------------+----------------+---------+-----------------------+ | VGETMANTPS/PD | v,v,v | 1 | 1 | p01/05 | 4 | 0.5-1 | | AND/ANDN/OR/ XORPS/PD | x,x / y,y,y | 1 | 1 | p015 | 1 | 0.33 | +-----------------------+-------------+-------------------+---------------------+----------------+---------+-----------------------+

Does that mean using a bitmask and logical and to get the mantissa of a float is faster than using the vgetmantps instruction?

How much is the latency for transferring the number from float to int and back to float?

If I also count the conversion from float to int and then int back to float, the vgetmantps is faster for ripping the mantissa out of an array of floats and store the result in-place.

– R zu
Sep 10 '18 at 15:55

Overtly, yes, it says ANDs are faster. However, they do different things. VGETMANTPS gets a normalized significand. That means it is suitable for more floating-point arithmetic, whereas the AND just gets you raw bits (without the leading bit suitable for normal/subnormal numbers). And performance optimization at this level is greatly subject to context—one instruction may be using processor units that you already have in use in neighboring instructions. Although it looks like they use the same units? What is Agner Fog’s p01/05 notation—does that mean it uses two units?

– Eric Postpischil
Sep 10 '18 at 16:17

VGETMANTPS

p01/05

@EricPostpischil: p01/p05 means it runs on either p0 or p1 for the xmm/ymm version (i.e. on the FMA units), but the ZMM version runs on p0 or p5. SKX shuts down port 1 when 512-bit uops are active, combines those two 256-bit FMA units into a single 512-bit FMA unit on p0, and powers up the extra 512-bit FMA unit on port 5. (Some SKX models don't have the extra port 5 FMA, so only have half throughput for 512-bit uops that need the FMA unit, which is actually a lot of them, including integer multiply and shift). The p5 FMA unit has worse latency, I think; it's physically distant.

– Peter Cordes
Sep 10 '18 at 16:24

1 Answer
1

For implementing log(x), you want the mantissa and exponent as float, and vgetmantps / vgetexpps are perfect for it. Efficient implementation of log2(__m256d) in AVX2. This is what those instructions are for, and do speed up a fast approximation to log(2). (Plus it can normalize the significant to -0.5 .. +0.5 instead of 0..1 or other neat ranges to create input for a polynomial approximation to log(x+1) or whatever. See its docs.)

log(x)

float

vgetmantps

vgetexpps

log(2)

log(x+1)

If you only want the mantissa as an integer, then sure AND away the other bits and you're done in one instruction.

(But remember that for a NaN, the mantissa is the NaN payload, so if you need to do something different for NaN then you need to check the exponent.)

How much is the latency for transferring the number from float to int and back to float?

You already have Agner Fog's instruction tables (https://agner.org/optimize/). On Skylake (SKL and SKX) VCVT(T) PS2DQ is 4c latency for the FMA ports, and so is the other direction.

VCVT(T) PS2DQ

Or are you asking about bypass latency for using the output of an FP instruction like andps as the input to an integer instruction?

andps

Agner Fog's microarch PDF has some info about bypass latency for sending data between vec-int and fp domains, but not many specifics.

Skylake's bypass latency is weird: unlike on previous uarches, it depends what port the instruction actually picked. andps has no bypass latency between FP instructions if it runs on port 5, but if it runs on p0 or p1, it has an extra 1c of latency.

andps

See Intel's optimization manual for a table of domain-crossing latencies broken down by domain+execution-port.

(And just to be extra weird, this bypass-delay latency affects that register forever, even after it has definitely written back to a physical register and isn't being forwarded over the bypass network. vpaddd xmm0, xmm1, xmm2 has 2c latency for both inputs if either input came from vmulps. But some shuffles and other instructions work in either domain. It was a while since I experimented with this, and I didn't check my notes, so this example might not be exactly right, but it's something like this.)

vpaddd xmm0, xmm1, xmm2

vmulps

(Intel's optimization manual doesn't mention this permanent effect which lasts until you overwrite the architectural register with a new value. So be careful about creating FP constants ahead of a loop with integer instructions.)

On the int/float latency, I think they may have been asking about the cost of using a floating-point instruction on data from some prior integer instruction. Even though the architectural register, such as %xmm1, is the same, some processors keep different instances for integer and float instructions. When you change instructions, the data has to be moved internally. But I think ANDPS counts as a float instruction, so there would be no cost.

– Eric Postpischil
Sep 10 '18 at 16:51

@EricPostpischil: Ah, that makes the question make more sense. updated.

– Peter Cordes
Sep 10 '18 at 17:01

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt