A month ago – April 2009 – I was talking with a friend about computational methods, efficiency of algorithms, and as common in these areas, implementations of Newton-Raphson method. He commented on the Fast InvSqrt function, used on Quake III’s code, which used to be faster than using the inverse of sqrt (the native library). At the same day, I did a research in google. Several interesting links – ([1] gamedev, [2] LOMONTE CHRIS, [3] betterexplained.com) – which taught me about the function:

float InvSqrt (float x) {
	float xhalf = 0.5f*x;
	int i = *(int*)&x;
	i = 0x5f3759df - (i>>1);
	x = *(float*)&i;
	x = x*(1.5f - xhalf*x*x);
	return x;
}

Then I compiled the function in some small programs to test the efficiency between invSqrt and the native implementation (float) (1.0/sqrt (x)).

The result was a little disappointing. The native implementation was faster than using the InvSqrt, possibly because architectures and compilers should have improved much in floating point arithmetics since 1999, when Quake III was released. And then I modify the implementation of the function using C union

/**
* UnionInvSqrt.c
* Author: Pedro Garcia
*
* http://www.sawp.com.br
* Licensed by Creative Commons: http://creativecommons.org/licenses/by-nc-nd/2.5/en/
**/
union casting {
	int i;
	float f;
};

float UnionInvSqrt (float x) {
	union casting cast;
	float xhalf = 0.5f*x;
	cast.f = x;
	cast.i = 0x5f3759df - ((cast.i)>>1);        //   x = *(float*)&i; and i = 0x5f3759df - (i>>1);
	return cast.f*(1.5f - xhalf*cast.f*cast.f); //   x = x*(1.5f - xhalf*x*x);
}

Certainly my function didn’t improve the efficiency in relation to the native implementation. However, using the same numerical value of invsqrt, we can obtain a small improvement in performance using it.

The way used to compare the performance of each function was very simple: in a pre-defined time for execution, which function would generate more terms? In 5 minutes, how many times the function (1/sqrt, UnionSqrt or InvSqrt) would run the code and return?

By running in a pre-determined time, the average speed of the generation of terms can be found as:

=(number of elements generated)/(pre-defined time) </p> where

fig12_fig0

Obviously, the faster implementation will generate more elements in the same pre-determined time. Therefore, the higher number of elements represents the faster function. Although the operating system processes were the only ones running (no graphical interface nor any other application sharing the processor time), it is natural that the implementation of a program has small variations in execution speed that doesn’t depend of the implementation of the code, but depend of the OS Scheduling. So, to ensure that the oscillation caused by the scheduling would not affect this comparison, I used a variable to store the value of the sum of the elements:

fig12_fig1

This series is known as the **Riemann Zeta Function**.

fig12_fig2

If S = 1/2, the function that we used for testing

fig12_fig3

Thus, the result of the calculations were:

Using InvSqrt (x):: Terms [1543696484], Serie [380334.639206]

Using 1/sqrt (x):: Terms [1623751167], Serie [400518.164524]

Using UnionInvSqrt (x):: Terms [1569004170], Serie [386502.806938]

As I said earlier, the average speed of generation of elements by using the native library was faster. However, if we compare the speed between the InvSqrt and implementation using union, we can see that the implementation using UnionInvSqrt represented an improvement on speed. The plot below shows the relation between elements generated per seconds:

fig12_fig4_elements_generated_in_time_bestintervall

In the graphic above, it can be noted that the words generation speed for each function is almost constant, as the graph is parallel to the axis of abscissae. That is, df(x)/dx = CONSTANT. This speed is 5.2 Mterms/s (Megaterms per seconds) using the native (float) 1/sqrt (x); 5.0 Mterms/s using the UnionInvSqrt; and 4.65 Mterms/s using a Fast InvSqrt. Since the speeds are almost constant, we can calculate the percentage difference between the speed of implementations InvSqrt and UnionInvSqrt:

((5000000-4650000)*100%)/(5000000+4650000) = 3,626943005%

I.e. UnionInvSqrt is about 3.63% faster than InvSqrt. Leaving aside the standards (as InvSqrt also did not follow the standard), the implementation using union represents a small improvement in performance. However, this could be of great impact if the function is intensely used during the program. </p>   ## Error between UnionInvSqrt vs InvSqrt An interesting fact to observe about the implementations is how the error between them is behaving. Comparing the relative error of the implementations InvSqrt and UnionInvSqrt on standard library (float)1.f/sqrt through a graph, we observe that both are behaving in much the same way.

fig12_fig5_error_invsqrt

fig12_fig6_error_union

Notice that the images overlap, we have the same graph. However, if you draw on the difference between the errors, you may notice a small error that occurs between the implementations:

fig12_fig7_diff_invsqrt_union_normal

and this difference has a highly oscillatory behavior:

fig12_fig8_diff_invsqrt_union_zoomed

By closely observing to these 2 graphics, it can be noted that they do not have any correlation with time. That is, the numerical behavior is to be analyzed, which does not depend on the state of the machine. This difference is not evident in the graph of the relative errors because the relative error of both implementations are around of 10^-6 and the differences between the UnionInvSqrt and InvSqrt is in the range of 10^-10, much smaller than the error on any of the two implementations. In theory, the difference between the images of both implementations should be 0, because it is expected that both functions have the same behavior. However, the fact of using only 1 variable in the case of UnionInvSqrt, generates a slightly different behavior when using the 2 variables (in InvSqrt).</p>   ## InvSqrt in JAVA The method is so good that can be used for some other platforms or languages. I wondered how this test would be implemented in JAVA. The implementation of InvSqrt in java is as follows:
/**
* Java InvSqrt
* Author: Pedro Garcia
* sawp@sawp.com.br
* http://www.sawp.com.br
* Licensed by Creative Commons: http://creativecommons.org/licenses/by-nc-nd/2.5/en/
**/

strictfp static float InvSqrt(float x){
	float xhalf = 0.5f * x;
	int i = Float.floatToRawIntBits(x); // convert integer to keep the representation IEEE 754

	i = 0x5f3759df - (i >> 1);
	x = Float.intBitsToFloat(i);
	x = x * (1.5f - xhalf * x * x);

	return x;
}
</p> I applied the same tests as with C, testing error and the difference in efficiency between the InvSqrt and using the native library (float) 1.f/Math.sqrt (x). In the graph, we can observe that there is no advantage in a matter of efficiency to use the InvSqrt:

fig12_fig9_termosporsegundoemjava_1h1

fig12_fig10_termosporsegundoemjava_1h3

Both implementations have shown the same behavior in the generation of elements per second. An interesting fact could be observed using both functions: the number of terms generated by time has a sinusoidal behavior. That is, the speed varies with time, unlike the implementation in C. Looking at the curves of the two functions, it can be seen that the approximation of InvSqrt is very close to 1.f/Math.sqrt (x):

fig12_fig11_invsqrt_math_sobrepostos

The error of InvSqrt on using standard library is behaving the same way the original implementation in C:

fig12_fig12_erro2

fig12_fig13_erro4

  ## InvSqrt in C# Also found an implementation of InvSqrt in C # using the class BitConverter at stackoverflow.com:
float InvSqrt(float x)
{
    float xhalf = 0.5f * x;
    int i = BitConverter.ToInt32(BitConverter.GetBytes(x), 0);
    i = 0x5f3759df - (i >> 1);
    x = BitConverter.ToSingle(BitConverter.GetBytes(i), 0);
    x = x * (1.5f - xhalf * x * x);
    return x;
}
I don’t tested the efficiency of the function or characteristics using .NET . To learn more see the link: http://stackoverflow.com/questions/268853/is-it-possible-to-write-quakes-fast-invsqrt-function-in-c That’s it! Unfortunately I do not have an old computer to test the implementation using union or InvSqrt in Java. However, if you want to see the results of my tests, the sources or test on other architectures and older machines, you can download the files at:

http://www.sawp.com.br/sources/InvSqrtC/

http://www.sawp.com.br/sources/BitsInJava/

To learn more about Invsqrt, the Wikipedia has an excellent set of information about it: http://en.wikipedia.org/wiki/Fast_inverse_square_root </p>   ## InvSqrt in Fortran 90
real function InvSqrt(x)
	implicit none

	type casting
		real :: x		
	end type casting

	real, intent(in) :: x

	! castinger
	type(casting), target :: pointerTo

	! Encode data as an array of integers
	integer, dimension(:), allocatable :: enc
	integer :: length
	real :: xhalf

	xhalf = 0.5 * x

	! transfer to heap ("promiscuous" operation)
	pointerTo%x = x

	! encode a memory section from a type to other
	length = size(transfer(pointerTo, enc))
	allocate(enc(length))		
	! encoded to integer, now, operate
	enc = transfer(pointerTo, enc)

	enc(1) = 1597463007 - rshift(enc(1),1)

	! decode
	pointerTo = transfer(enc, pointerTo)
	
	! dealloc
	deallocate(enc)

	InvSqrt = pointerTo%x*(1.5 - xhalf*pointerTo%x*pointerTo%x)
end function InvSqrt
  ## References [1] www.gamedev.net/community/forums/topic.asp?topic id=139956 [2] Chris Lomont, www.math.purdue.edu/~clomont, “FAST INVERT SQUARE ROOT”, http://www.lomont.org/Math/Papers/2003/InvSqrt.pdf. [3] http://betterexplained.com/articles/understanding-quakes-fast-inverse-square-root/