Fast single precision reciprocal square root in C++ on very low precision

Fast single precision reciprocal square root in C++ on very low precision



I have a line in C++


c[i] = sqrtf(a[i]);



and assembly code looks


002D11D0 vsqrtps ymm0,ymmword ptr a (202D3380h)[eax]



With a line


c[i] = 1.0f / sqrtf(a[i]);



i have an assembly


00E71210 vrsqrtps ymm1,ymm0
00E71214 vmulps ymm0,ymm1,ymm0
00E71218 vmulps ymm0,ymm0,ymm1
00E7121C vsubps ymm0,ymm0,ymm6
00E71220 vmulps ymm0,ymm0,ymm1
00E71224 vmulps ymm0,ymm0,ymm7



It is obviously reasonable becouse vrsqrtps is much more faster than vsqrtps. So in case of reciprocal value of square root its simply faster to call non-accurate function vrsqrtps and then do some two iteration to get more precise value.


vrsqrtps


vsqrtps


vrsqrtps



And my question is:
Is it possible to tell the compiler that additional iterations are not nessesary? So the assembly will be without additional multiplications. Error of ~1.5 * 2^-12 is fully sufficient for me, since i want to add thousands of these results where many bits of accuracy will be droped as well. I prefer a way to not inlining some assembly code into C++ code.



(after edit) Compiler comand line:


/GS /Qpar /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /Ox /Ob2 /sdl /Fd"Releasevc141.pdb" /Zc:inline /fp:fast /D "_MBCS" /errorReport:prompt /WX- /Zc:forScope /arch:AVX2 /Gd /Oy- /Oi /MD /Fa"Release" /EHsc /nologo /Fo"Release" /Ot /Fp"Releaseperformancetest.pch" /diagnostics:classic






Maybe you can use some genetic algorithm to derive a short polynomial for you, to give an estimate of 1/sqrt. I think magic numbers can be generated like this. a1+a2*x+a2*x*x +... as many elements as you need. I don't know how fast vrsqrtps than these are.

– huseyin tugrul buyukisik
Sep 16 '18 at 15:11







which compiler do you use?

– user32434999
Sep 16 '18 at 15:39






The magic number trick is allready implemented into rsqrt instrunction. Its not wise to write it on my own since this instruction has exactly the same latency and trougthput than one ordinary multiplication i guess. (i use standard Microsoft C++ compiler for Visual Studio 2017)

– Marek Basovník
Sep 16 '18 at 15:56







"Is it possible to tell the compiler that additional iterations are not nessesary?": I think there isn't. You need to use intrinsics to solve the problem. It is not that hard to do, and you can achieve the same vectorization as the compiler does.

– geza
Sep 16 '18 at 17:34






I was finally able to reproduce it on godbolt given that you initially didn't provide even the compiler, and then not the flags nor full source code. (You can ignore the aliasing checks and alternative assembly, I was too lazy to figure out how to tell MSVC that the data is both aligned and __restrict.) I recommend you to include this information in the future, especially for questions which are highly compiler and flag-specific. Not only will it allow you to avoid possible down votes, but also it will help people help you.

– Arne Vogel
Sep 16 '18 at 17:44


__restrict




0



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

How do I collapse sections of code in Visual Studio Code for Windows?

ャフサォクコ ケウ,コ,ワ メ,ロスョノ゙,クネ,フムカヤヲニ,エコ゚ツ ウイオン゙ケワサネォキモュキォウイノンコチ゚メヌナイゥフュ,カヒウネェ ネ,ホノケ,ムュキ ッボーミュハ,チ ツス ィ メウイマヤ,゙ウチ ヅ ロ,ォジヌェ ャヌット ェ,マャ,チナエヒネソキツテ トホヲヲミーァ