Fast single precision reciprocal square root in C++ on very low precision
Fast single precision reciprocal square root in C++ on very low precision
I have a line in C++
c[i] = sqrtf(a[i]);
and assembly code looks
002D11D0 vsqrtps ymm0,ymmword ptr a (202D3380h)[eax]
With a line
c[i] = 1.0f / sqrtf(a[i]);
i have an assembly
00E71210 vrsqrtps ymm1,ymm0
00E71214 vmulps ymm0,ymm1,ymm0
00E71218 vmulps ymm0,ymm0,ymm1
00E7121C vsubps ymm0,ymm0,ymm6
00E71220 vmulps ymm0,ymm0,ymm1
00E71224 vmulps ymm0,ymm0,ymm7
It is obviously reasonable becouse vrsqrtps is much more faster than vsqrtps. So in case of reciprocal value of square root its simply faster to call non-accurate function vrsqrtps and then do some two iteration to get more precise value.
vrsqrtps
vsqrtps
vrsqrtps
And my question is:
Is it possible to tell the compiler that additional iterations are not nessesary? So the assembly will be without additional multiplications. Error of ~1.5 * 2^-12 is fully sufficient for me, since i want to add thousands of these results where many bits of accuracy will be droped as well. I prefer a way to not inlining some assembly code into C++ code.
(after edit) Compiler comand line:
/GS /Qpar /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /Ox /Ob2 /sdl /Fd"Releasevc141.pdb" /Zc:inline /fp:fast /D "_MBCS" /errorReport:prompt /WX- /Zc:forScope /arch:AVX2 /Gd /Oy- /Oi /MD /Fa"Release" /EHsc /nologo /Fo"Release" /Ot /Fp"Releaseperformancetest.pch" /diagnostics:classic
which compiler do you use?
– user32434999
Sep 16 '18 at 15:39
The magic number trick is allready implemented into rsqrt instrunction. Its not wise to write it on my own since this instruction has exactly the same latency and trougthput than one ordinary multiplication i guess. (i use standard Microsoft C++ compiler for Visual Studio 2017)
– Marek Basovník
Sep 16 '18 at 15:56
"Is it possible to tell the compiler that additional iterations are not nessesary?": I think there isn't. You need to use intrinsics to solve the problem. It is not that hard to do, and you can achieve the same vectorization as the compiler does.
– geza
Sep 16 '18 at 17:34
I was finally able to reproduce it on godbolt given that you initially didn't provide even the compiler, and then not the flags nor full source code. (You can ignore the aliasing checks and alternative assembly, I was too lazy to figure out how to tell MSVC that the data is both aligned and
__restrict.) I recommend you to include this information in the future, especially for questions which are highly compiler and flag-specific. Not only will it allow you to avoid possible down votes, but also it will help people help you.– Arne Vogel
Sep 16 '18 at 17:44
__restrict
0
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy
Maybe you can use some genetic algorithm to derive a short polynomial for you, to give an estimate of 1/sqrt. I think magic numbers can be generated like this. a1+a2*x+a2*x*x +... as many elements as you need. I don't know how fast vrsqrtps than these are.
– huseyin tugrul buyukisik
Sep 16 '18 at 15:11