Pow()与const非整数指数的优化?
我在我的代码中有热点,我正在做pow()
,占我执行时间的10-20%左右。
我对pow(x,y)
input是非常具体的,所以我想知道是否有一种方法能够以更高的性能滚动两个pow()
近似值(每个指数一个)
- 我有两个常数指数:2.4和1 / 2.4。
- 当指数为2.4时, x将在范围内(0.090473935,1.0)。
- 当指数为1 / 2.4时, x将在(0.0031308,1.0)的范围内。
- 我正在使用SSE / AVX
float
向量。 如果平台具体可以被利用,就对!
尽pipe我对全精度( float
)algorithm也很感兴趣,但最大错误率在0.01%左右是理想的。
我已经使用了一个快速的pow()
近似 ,但是没有考虑到这些限制。 有没有可能做得更好?
在IEEE 754黑客脉络中,这是另一个更快,更“神奇”的解决scheme。 它在十几个时钟周期内达到了0.08%的误差(对于p = 2.4的情况,在Intel Merom CPU上)。
浮点数最初是作为对数的近似值发明的,因此您可以使用整数值作为log2
的近似值。 通过将convert-from-integer指令应用于浮点值,可以实现这个function,以获得另一个浮点值。
要完成pow
计算,可以乘以一个常数因子,并将对数转换为convert-to-integer指令。 在SSE上,相关说明是cvtdq2ps
和cvtps2dq
。
虽然这并不那么简单。 IEEE 754中的指数字段是有符号的,偏差值为127,表示指数为零。 在乘以对数之前,必须先消除这种偏差,然后在求幂之前重新添加。 此外,通过减法的偏差调整不会在零上工作。 幸运的是,两种调整都可以通过预先乘以一个常数来实现。
x^p = exp2( p * log2( x ) ) = exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 ) = cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) ) = cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) ) = cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) ) = cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )
exp2( 127 / p - 127 )
是常数因子。 这个函数是相当专业的:它不会用小分数指数工作,因为常数因数随着指数的倒数呈指数增长,并且会溢出。 它不会与负指数一起工作。 大的指数会导致很高的误差,因为尾数位与乘法的指数位混杂在一起。
但是,这只是4个快速指令。 预乘,从“整数”(以对数),功率乘,转换为“整数”(从对数)。 在这个SSE的实施上,转换速度非常快。 我们也可以把一个额外的常系数挤进第一个乘法中。
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden > __m128 fastpow( __m128 arg ) { __m128 ret = arg; // std::printf( "arg = %,vg\n", ret ); // Apply a constant pre-correction factor. ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. ) * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) ); // std::printf( "scaled = %,vg\n", ret ); // Reinterpret arg as integer to obtain logarithm. asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "log = %,vg\n", ret ); // Multiply logarithm by power. ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) ); // std::printf( "powered = %,vg\n", ret ); // Convert back to "integer" to exponentiate. asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "result = %,vg\n", ret ); return ret; }
指数= 2.4的一些试验显示,这一直高估了大约5%。 (例程总是保证高估。)你可以简单地乘以0.95,但是更多的指令会得到大约4位十进制数字的精度,这对graphics来说应该足够了。
关键是要高估低估,取平均水平。
- 计算x ^ 0.8:四条指令,错误〜+ 3%。
- 计算x ^ -0.4:一个
rsqrtps
。 (这是相当准确的,但是牺牲零能力的工作。) - 计算x ^ 0.4:一个
mulps
。 - 计算x ^ -0.2:一个
rsqrtps
。 - 计算x ^ 2:一个
mulps
。 - 计算x ^ 3:一个
mulps
。 - x ^ 2.4 = x ^ 2 * x ^ 0.4:一个
mulps
。 这是高估。 - x ^ 2.4 = x ^ 3 * x ^ -0.4 * x ^ -0.2:两个
mulps
。 这是低估。 - 平均以上:一个
addps
,一个mulps
。
指令计数:十四,其中包括两个等待时间= 5的转换和两个倒数平方根估计,吞吐量= 4。
为了适当地取平均值,我们想用他们的预期误差加权估计。 低估将错误提高到0.6比0.4的功率,所以我们预计它是错误的1.5倍。 加权不会添加任何说明; 它可以在前期因素中完成。 调用系数a:a ^ 0.5 = 1.5 a ^ -0.75,a = 1.38316186。
最终的误差大约是0.015%,或比最初的fastpow
结果好2个数量级。 对于具有volatile
源variables和目标variables的繁忙循环,运行时间大约是十几个周期……虽然它与迭代重叠,但是实际使用情况也将看到指令级并行性。 考虑到SIMD,这是每3个周期一个标量结果的吞吐量!
int main() { __m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 ); std::printf( "Input: %,vg\n", x0 ); // Approx 5% accuracy from one call. Always an overestimate. __m128 x1 = fastpow< 24, 10, 1, 1 >( x0 ); std::printf( "Direct x^2.4: %,vg\n", x1 ); // Lower exponents provide lower initial error, but too low causes overflow. __m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 ); std::printf( "1.38 x^0.8: %,vg\n", xf ); // Imprecise 4-cycle sqrt is still far better than fastpow, good enough. __m128 xfm4 = _mm_rsqrt_ps( xf ); __m128 xf4 = _mm_mul_ps( xf, xfm4 ); // Precisely calculate x^2 and x^3 __m128 x2 = _mm_mul_ps( x0, x0 ); __m128 x3 = _mm_mul_ps( x2, x0 ); // Overestimate of x^2 * x^0.4 x2 = _mm_mul_ps( x2, xf4 ); // Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4. __m128 xfm2 = _mm_rsqrt_ps( xf4 ); x3 = _mm_mul_ps( x3, xfm4 ); x3 = _mm_mul_ps( x3, xfm2 ); std::printf( "x^2 * x^0.4: %,vg\n", x2 ); std::printf( "x^3 / x^0.6: %,vg\n", x3 ); x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) ); // Final accuracy about 0.015%, 200x better than x^0.8 calculation. std::printf( "average = %,vg\n", x2 ); }
那么…对不起,我无法提前发表。 并将其扩展到x ^ 1 / 2.4作为练习; v)。
更新与统计
我实施了一个小testing线束和两个相应的x ( 5/12 )的情况。
#include <cstdio> #include <xmmintrin.h> #include <cmath> #include <cfloat> #include <algorithm> using namespace std; template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden > __m128 fastpow( __m128 arg ) { __m128 ret = arg; // std::printf( "arg = %,vg\n", ret ); // Apply a constant pre-correction factor. ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. ) * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) ); // std::printf( "scaled = %,vg\n", ret ); // Reinterpret arg as integer to obtain logarithm. asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "log = %,vg\n", ret ); // Multiply logarithm by power. ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) ); // std::printf( "powered = %,vg\n", ret ); // Convert back to "integer" to exponentiate. asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "result = %,vg\n", ret ); return ret; } __m128 pow125_4( __m128 arg ) { // Lower exponents provide lower initial error, but too low causes overflow. __m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg ); // Imprecise 4-cycle sqrt is still far better than fastpow, good enough. __m128 xfm4 = _mm_rsqrt_ps( xf ); __m128 xf4 = _mm_mul_ps( xf, xfm4 ); // Precisely calculate x^2 and x^3 __m128 x2 = _mm_mul_ps( arg, arg ); __m128 x3 = _mm_mul_ps( x2, arg ); // Overestimate of x^2 * x^0.4 x2 = _mm_mul_ps( x2, xf4 ); // Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6. __m128 xfm2 = _mm_rsqrt_ps( xf4 ); x3 = _mm_mul_ps( x3, xfm4 ); x3 = _mm_mul_ps( x3, xfm2 ); return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) ); } __m128 pow512_2( __m128 arg ) { // 5/12 is too small, so compute the sqrt of 10/12 instead. __m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg ); return _mm_mul_ps( _mm_rsqrt_ps( x ), x ); } __m128 pow512_4( __m128 arg ) { // 5/12 is too small, so compute the 4th root of 20/12 instead. // 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow. // weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3 __m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg ); __m128 xover = _mm_mul_ps( arg, xf ); __m128 xfm1 = _mm_rsqrt_ps( xf ); __m128 x2 = _mm_mul_ps( arg, arg ); __m128 xunder = _mm_mul_ps( x2, xfm1 ); // sqrt2 * over + 2 * sqrt2 * under __m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ), _mm_add_ps( xover, xunder ) ); xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) ); xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) ); return xavg; } __m128 mm_succ_ps( __m128 arg ) { return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) ); } void test_pow( double p, __m128 (*f)( __m128 ) ) { __m128 arg; for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON ); ! isfinite( _mm_cvtss_f32( f( arg ) ) ); arg = mm_succ_ps( arg ) ) ; for ( ; _mm_cvtss_f32( f( arg ) ) == 0; arg = mm_succ_ps( arg ) ) ; std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) ); int n; int const bucket_size = 1 << 25; do { float max_error = 0; double total_error = 0, cum_error = 0; for ( n = 0; n != bucket_size; ++ n ) { float result = _mm_cvtss_f32( f( arg ) ); if ( ! isfinite( result ) ) break; float actual = ::powf( _mm_cvtss_f32( arg ), p ); float error = ( result - actual ) / actual; cum_error += error; error = std::abs( error ); max_error = std::max( max_error, error ); total_error += error; arg = mm_succ_ps( arg ); } std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n", max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) ); } while ( n == bucket_size ); } int main() { std::printf( "4 insn x^12/5:\n" ); test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > ); std::printf( "14 insn x^12/5:\n" ); test_pow( 12./5, & pow125_4 ); std::printf( "6 insn x^5/12:\n" ); test_pow( 5./12, & pow512_2 ); std::printf( "14 insn x^5/12:\n" ); test_pow( 5./12, & pow512_4 ); }
输出:
4 insn x^12/5: Domain from 1.36909e-23 error max = inf avg = inf |avg| = inf to 8.97249e-19 error max = 2267.14 avg = 139.175 |avg| = 139.193 to 5.88021e-14 error max = 0.123606 avg = -0.000102963 |avg| = 0.0371122 to 3.85365e-09 error max = 0.123607 avg = -0.000108978 |avg| = 0.0368548 to 0.000252553 error max = 0.12361 avg = 7.28909e-05 |avg| = 0.037507 to 16.5513 error max = 0.123612 avg = -0.000258619 |avg| = 0.0365618 to 1.08471e+06 error max = 0.123611 avg = 8.70966e-05 |avg| = 0.0374369 to 7.10874e+10 error max = 0.12361 avg = -0.000103047 |avg| = 0.0371122 to 4.65878e+15 error max = 0.123609 avg = nan |avg| = nan to 1.16469e+16 14 insn x^12/5: Domain from 1.42795e-19 error max = inf avg = nan |avg| = nan to 9.35823e-15 error max = 0.000936462 avg = 2.0202e-05 |avg| = 0.000133764 to 6.13301e-10 error max = 0.000792752 avg = 1.45717e-05 |avg| = 0.000129936 to 4.01933e-05 error max = 0.000791785 avg = 7.0132e-06 |avg| = 0.000129923 to 2.63411 error max = 0.000787589 avg = 1.20745e-05 |avg| = 0.000129347 to 172629 error max = 0.000786553 avg = 1.62351e-05 |avg| = 0.000132397 to 1.13134e+10 error max = 0.000785586 avg = 8.25205e-06 |avg| = 0.00013037 to 6.98147e+12 6 insn x^5/12: Domain from 9.86076e-32 error max = 0.0284339 avg = 0.000441158 |avg| = 0.00967327 to 6.46235e-27 error max = 0.0284342 avg = -5.79938e-06 |avg| = 0.00897913 to 4.23516e-22 error max = 0.0284341 avg = -0.000140706 |avg| = 0.00897084 to 2.77556e-17 error max = 0.028434 avg = 0.000440504 |avg| = 0.00967325 to 1.81899e-12 error max = 0.0284339 avg = -6.11153e-06 |avg| = 0.00897915 to 1.19209e-07 error max = 0.0284298 avg = -0.000140597 |avg| = 0.00897084 to 0.0078125 error max = 0.0284371 avg = 0.000439748 |avg| = 0.00967319 to 512 error max = 0.028437 avg = -7.74294e-06 |avg| = 0.00897924 to 3.35544e+07 error max = 0.0284369 avg = -0.000142036 |avg| = 0.00897089 to 2.19902e+12 error max = 0.0284368 avg = 0.000439183 |avg| = 0.0096732 to 1.44115e+17 error max = 0.0284367 avg = -7.41244e-06 |avg| = 0.00897923 to 9.44473e+21 error max = 0.0284366 avg = -0.000141706 |avg| = 0.00897088 to 6.1897e+26 error max = 0.485129 avg = -0.0401671 |avg| = 0.048422 to 4.05648e+31 error max = 0.994932 avg = -0.891494 |avg| = 0.891494 to 2.65846e+36 error max = 0.999329 avg = nan |avg| = nan to -0 14 insn x^5/12: Domain from 2.64698e-23 error max = 0.13556 avg = 0.00125936 |avg| = 0.00354677 to 1.73472e-18 error max = 0.000564988 avg = 2.51458e-06 |avg| = 0.000113709 to 1.13687e-13 error max = 0.000565065 avg = -1.49258e-06 |avg| = 0.000112553 to 7.45058e-09 error max = 0.000565143 avg = 1.5293e-06 |avg| = 0.000112864 to 0.000488281 error max = 0.000565298 avg = 2.76457e-06 |avg| = 0.000113713 to 32 error max = 0.000565453 avg = -1.61276e-06 |avg| = 0.000112561 to 2.09715e+06 error max = 0.000565531 avg = 1.42628e-06 |avg| = 0.000112866 to 1.37439e+11 error max = 0.000565686 avg = 2.71505e-06 |avg| = 0.000113715 to 9.0072e+15 error max = 0.000565763 avg = -1.56586e-06 |avg| = 0.000112415 to 1.84467e+19
我怀疑更准确的5/12是由rsqrt
操作限制的rsqrt
。
另一个答案,因为这是从我以前的答案非常不同,这是快速的。 相对误差是3e-8。 想要更精确? 再加几个Chebychev条款。 最好保持奇数的顺序,因为这会导致2 ^ n-epsilon和2 ^ n + epsilon之间的小的不连续性。
#include <stdlib.h> #include <math.h> // Returns x^(5/12) for x in [1,2), to within 3e-8 (relative error). // Want more precision? Add more Chebychev polynomial coefs. double pow512norm ( double x) { static const int N = 8; // Chebychev polynomial terms. // Non-zero terms calculated via // integrate (2/pi)*ChebyshevT[n,u]/sqrt(1-u^2)*((u+3)/2)^(5/12) // from -1 to 1 // Zeroth term is similar except it uses 1/pi rather than 2/pi. static const double Cn[N] = { 1.1758200232996901923, 0.16665763094889061230, -0.0083154894939042125035, 0.00075187976780420279038, // Wolfram alpha doesn't want to compute the remaining terms // to more precision (it times out). -0.0000832402, 0.0000102292, -1.3401e-6, 1.83334e-7}; double Tn[N]; double u = 2.0*x - 3.0; Tn[0] = 1.0; Tn[1] = u; for (int ii = 2; ii < N; ++ii) { Tn[ii] = 2*u*Tn[ii-1] - Tn[ii-2]; } double y = 0.0; for (int ii = N-1; ii >= 0; --ii) { y += Cn[ii]*Tn[ii]; } return y; } // Returns x^(5/12) to within 3e-8 (relative error). double pow512 ( double x) { static const double pow2_512[12] = { 1.0, pow(2.0, 5.0/12.0), pow(4.0, 5.0/12.0), pow(8.0, 5.0/12.0), pow(16.0, 5.0/12.0), pow(32.0, 5.0/12.0), pow(64.0, 5.0/12.0), pow(128.0, 5.0/12.0), pow(256.0, 5.0/12.0), pow(512.0, 5.0/12.0), pow(1024.0, 5.0/12.0), pow(2048.0, 5.0/12.0) }; double s; int iexp; s = frexp (x, &iexp); s *= 2.0; iexp -= 1; div_t qr = div (iexp, 12); if (qr.rem < 0) { qr.quot -= 1; qr.rem += 12; } return ldexp (pow512norm(s)*pow2_512[qr.rem], 5*qr.quot); }
附录:这是怎么回事?
根据请求,下面解释了上面的代码是如何工作的。
概观
上面的代码定义了两个函数, double pow512norm (double x)
和double pow512 (double x)
。 后者是套件的入口点; 这是用户代码应该调用来计算x ^(5/12)的函数。 函数pow512norm(x)
使用Chebyshev多项式来近似x ^(5/12),但只适用于范围[1,2]中的x。 (使用pow512norm(x)
作为该范围之外的x的值,结果将是垃圾。)
函数pow512(x)
将inputx
分解成一对(double s, int n)
,使得x = s * 2^n
并且使得1≤s<2。 进一步将n
分割成(int q, unsigned int r)
使得n = 12*q + r
且r
小于12,可以将x ^(5/12)的问题分解为几部分:
- 通过(u v)^ a =(u ^ a) (v ^ a)代替(5)积极的,你和真实的。
-
s^(5/12)
通过pow512norm(s)
计算。 -
(2^n)^(5/12)=(2^(12*q+r))^(5/12)
。 - 通过
u^(a+b)=(u^a)*(u^b)
来计算正数u,2^(12*q+r)=(2^(12*q))*(2^r)
A,b。 -
(2^(12*q+r))^(5/12)=(2^(5*q))*((2^r)^(5/12))
。 -
(2^r)^(5/12)
由查找表pow2_512
计算。 - 计算
pow512norm(s)*pow2_512[qr.rem]
,我们pow512norm(s)*pow2_512[qr.rem]
。 这里qr.rem
是上面第3步计算出的r
值。 所需要的就是把它乘以2^(5*q)
来得到想要的结果。 - 这正是math库函数
ldexp
所做的。
函数逼近
这里的目标是想出一个容易计算的f(x)= x ^(5/12)的近似值,对于手头的问题来说是“够好的”。 我们的近似值在某种意义上应该接近f(x)。 修辞问题:“接近”是什么意思? 两种相互竞争的解释是最小化均方误差与最大绝对误差的最小化。
我将用股票市场的比喻来描述这些差异。 假设你想为你的退休储蓄。 如果你二十多岁,最好的办法是投资股票或股票市场基金。 这是因为在足够长的时间内,股票市场平均击败其他任何投资计划。 但是,我们都看过把钱投入股票的时代,这是一件非常糟糕的事情。 如果你五六十岁(或四十岁,如果你想退休),你需要更保守地投资。 这些下滑可能会对你的退休投资组合产生影响。
回到函数逼近:作为一些近似的消费者,您通常担心最糟糕的错误,而不是“平均”的性能。 使用一些近似值来构build“平均”最佳性能(例如最小二乘法),而墨菲法则规定,如果性能远远低于平均值,那么程序将花费大量时间使用近似值。 你想要的是一个极小极大值近似值,这个极值极小值可以使某个域的最大绝对误差最小化。 一个好的math图书馆将采取最小方法,而不是最小二乘法,因为这可以让math图书馆的作者给出一些有保证的图书馆performance。
math库通常使用多项式或有理多项式来近似某个域a≤x≤b上的某个函数f(x)。 假设函数f(x)是在这个域上parsing的,并且你想通过一些N次多项式p(x)来近似函数。对于一个给定的度N存在一些神奇的唯一的多项式p(x),使得p x)-f(x)在[a,b]上有N + 2个极值,所以这N + 2个极值的绝对值都是相等的。 寻找这个神奇的多项式p(x)是函数逼近器的圣杯。
我没有为你find圣杯。 我改用Chebyshev近似。 第一类Chebyshev多项式是一个正交(但不正交)的多项式集,当涉及到函数逼近时,它具有一些非常好的特性。 契比雪夫近似常常与魔法多项式p(x)非常接近。 (实际上,确实发现圣杯多项式的Remez交换algorithm通常以Chebyshev近似开始。)
pow512norm(x)的
这个函数使用切比雪夫近似来找出近似x ^(5/12)的多项式p *(x)。 这里我使用p *(x)来区分这个Chebyshev近似值和上面描述的魔术多项式p(x)。 Chebyshev近似p *(x)很容易find; 发现p(x)是一只熊。 Chebyshev逼近p *(x)是sum_i Cn [i] * Tn(i,x),其中Cn [i]是Chebyshev系数,Tn(i,x)是在x处求解的Chebyshev多项式。
我用Wolfram alpha来findChebyshev系数Cn
。 例如, 这个计算Cn[1]
。 input框之后的第一个框具有所需的答案,在这种情况下为0.166658。 这不是我想要的那么多的数字。 点击“更多数字”,瞧,你会得到更多的数字。 Wolfram alpha是免费的; 它会做多less计算有一个限制。 它在更高阶的条件下达到了极限。 (如果您购买或获得了math,您将能够高度精确地计算这些高阶系数。)
切比雪夫多项式Tn(x)在Tn
中计算。 除了给出非常接近神奇多项式p(x)的东西之外,使用切比雪夫近似的另一个原因是这些切比雪夫多项式的值很容易计算:从Tn[0]=1
并且Tn[1]=x
,然后迭代地计算Tn[i]=2*x*Tn[i-1] - Tn[i-2]
。 (在我的代码中,我使用'ii'作为索引variables而不是'i',我从来没有用'i'作为variables名,英语中有多less个单词有'i'这个单词?连续的'我的?
pow512(x)的
pow512
是用户代码应该调用的函数。 我已经在上面描述了这个函数的基础。 更多细节:math库函数frexp(x)
返回inputx
的有效数和指数iexp
。 (次要问题:我想在1到2之间用于pow512norm
但frexp
返回一个介于0.5和1之间的值)。math库函数div
在一个浮点函数中返回整数除的商和余数。 最后,我使用math库函数ldexp
将三个部分放在一起形成最终答案。
伊恩·史蒂芬森(Ian Stephenson)写的这个代码 ,他声称超越了pow()
。 他将这个想法描述如下:
Pow基本上是用log's:
pow(a,b)=x(logx(a)*b)
。 所以我们需要一个快速的日志和快速的指数 – 不pipe什么x是如此我们使用2.技巧是浮点数已经在一个日志样式格式:a=M*2E
取双方的日志给出:
log2(a)=log2(M)+E
或者更简单:
log2(a)~=E
换句话说,如果我们取一个数的浮点数表示,并提取指数,那么我们就有了一个很好的起点。 事实certificate,当我们通过按摩位模式来做到这一点时,尾数最后给出了一个很好的近似误差,并且它工作得很好。
这对于简单的照明计算应该足够好,但是如果你需要更好的东西,那么你可以提取尾数,并用它来计算一个相当准确的二次修正因子。
首先,如今在大多数机器上使用浮动机器不会买得太多。 事实上,双打可以更快。 你的力量,1.0 / 2.4,是5/12或者1/3 *(1 + 1/4)。 即使这是调用cbrt(一次)和sqrt(两次!),它仍然是使用pow()的两倍。 (优化:-O3,编译器:i686-apple-darwin10-g ++ – 4.2.1)。
#include <math.h> // cmath does not provide cbrt; C99 does. double xpow512 (double x) { double cbrtx = cbrt(x); return cbrtx*sqrt(sqrt(cbrtx)); }
这可能不会回答你的问题。
2.4f
和1/2.4f
使我非常怀疑,因为那些是用来在sRGB和线性RGB色彩空间之间转换的力量。 所以你可能实际上正在试图优化,具体来说。 我不知道,这就是为什么这可能不会回答你的问题。
如果是这种情况,请尝试使用查找表。 就像是:
__attribute__((aligned(64)) static const unsigned short SRGB_TO_LINEAR[256] = { ... }; __attribute__((aligned(64)) static const unsigned short LINEAR_TO_SRGB[256] = { ... }; void apply_lut(const unsigned short lut[256], unsigned char *src, ...
如果您正在使用16位数据,请根据需要进行更改。 无论如何,我将把表格设置为16位,这样在处理8位数据时,如果需要,可以对结果进行仿色。 如果你的数据是以浮点数开始的话,这显然不会工作得很好 – 但是将sRGB数据存储在浮点数并没有什么意义,所以你最好先转换成16位/ 8位然后从线性转换到sRGB。
(The reason sRGB doesn't make sense as floating point is that HDR should be linear, and sRGB is only convenient for storing on disk or displaying on screen, but not convenient for manipulation.)
Binomial series does account for a constant exponent, but you will be able to use it only if you can normalize all your input to the range [1,2). (Note that it computes (1+x)^a). You'll have to do some analysis to decide how many terms you need for your desired accuracy.
For exponents of 2.4, you could either make a lookup table for all your 2.4 values and lirp or perhaps higher-order function to fill in the in-betweem values if the table wasn't accurate enough (basically a huge log table.)
Or, value squared * value to the 2/5s which could take the initial square value from the first half of the function and then 5th root it. For the 5th root, you could Newton it or do some other fast approximator, though honestly once you get to this point, your probably better off just doing the exp and log functions with the appropriate abbreviated series functions yourself.
The following is an idea you can use with any of the fast calculation methods. Whether it helps things go faster depends on how your data arrives. You can use the fact that if you know x
and pow(x, n)
, you can use the rate of change of the power to compute a reasonable approximation of pow(x + delta, n)
for small delta
, with a single multiply and add (more or less). If successive values you feed your power functions are close enough together, this would amortize the full cost of the accurate calculation over multiple function calls. Note that you don't need an extra pow calculation to get the derivative. You could extend this to use the second derivative so you can use a quadratic, which would increase the delta
you could use and still get the same accuracy.
I shall answer the question you really wanted to ask, which is how to do fast sRGB <-> linear RGB conversion. To do this precisely and efficiently we can use polynomial approximations. The following polynomial approximations have been generated with sollya, and have a worst case relative error of 0.0144%.
inline double poly7(double x, double a, double b, double c, double d, double e, double f, double g, double h) { double ab, cd, ef, gh, abcd, efgh, x2, x4; x2 = x*x; x4 = x2*x2; ab = a*x + b; cd = c*x + d; ef = e*x + f; gh = g*x + h; abcd = ab*x2 + cd; efgh = ef*x2 + gh; return abcd*x4 + efgh; } inline double srgb_to_linear(double x) { if (x <= 0.04045) return x / 12.92; // Polynomial approximation of ((x+0.055)/1.055)^2.4. return poly7(x, 0.15237971711927983387, -0.57235993072870072762, 0.92097986411523535821, -0.90208229831912012386, 0.88348956209696805075, 0.48110797889132134175, 0.03563925285274562038, 0.00084585397227064120); } inline double linear_to_srgb(double x) { if (x <= 0.0031308) return x * 12.92; // Piecewise polynomial approximation (divided by x^3) // of 1.055 * x^(1/2.4) - 0.055. if (x <= 0.0523) return poly7(x, -6681.49576364495442248881, 1224.97114922729451791383, -100.23413743425112443219, 6.60361150127077944916, 0.06114808961060447245, -0.00022244138470139442, 0.00000041231840827815, -0.00000000035133685895) / (x*x*x); return poly7(x, -0.18730034115395793881, 0.64677431008037400417, -0.99032868647877825286, 1.20939072663263713636, 0.33433459165487383613, -0.01345095746411287783, 0.00044351684288719036, -0.00000664263587520855) / (x*x*x); }
And the sollya input used to generate the polynomials:
suppressmessage(174); f = ((x+0.055)/1.055)^2.4; p0 = fpminimax(f, 7, [|D...|], [0.04045;1], relative); p = fpminimax(f/(p0(1)+1e-18), 7, [|D...|], [0.04045;1], relative); print("relative:", dirtyinfnorm((fp)/f, [s;1])); print("absolute:", dirtyinfnorm((fp), [s;1])); print(canonical(p)); s = 0.0523; z = 3; f = 1.055 * x^(1/2.4) - 0.055; p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [0.0031308;s], relative)/x^z; print("relative:", dirtyinfnorm((fp)/f, [0.0031308;s])); print("absolute:", dirtyinfnorm((fp), [0.0031308;s])); print(canonical(p)); p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [s;1], relative)/x^z; print("relative:", dirtyinfnorm((fp)/f, [s;1])); print("absolute:", dirtyinfnorm((fp), [s;1])); print(canonical(p));
So traditionally the powf(x, p) = x^p
is solved by rewriting x
as x=2^(log2(x))
making powf(x,p) = 2^(p*log2(x))
, which transforms the problem into two approximations exp2()
& log2()
. This has the advantage of working with larger powers p
, however the downside is that this is not the optimal solution for a constant power p
and over a specified input bound 0 ≤ x ≤ 1
.
When the power p > 1
, the answer is a trivial minimax polynomial over the bound 0 ≤ x ≤ 1
, which is the case for p = 12/5 = 2.4
as can be seen below:
float pow12_5(float x){ float mp; // Minimax horner polynomials for x^(5/12), Note: choose the accurarcy required then implement with fma() [Fused Multiply Accumulates] // mp = 0x4.a84a38p-12 + x * (-0xd.e5648p-8 + x * (0xa.d82fep-4 + x * 0x6.062668p-4)); // 1.13705697e-3 mp = 0x1.117542p-12 + x * (-0x5.91e6ap-8 + x * (0x8.0f50ep-4 + x * (0xa.aa231p-4 + x * (-0x2.62787p-4)))); // 2.6079002e-4 // mp = 0x5.a522ap-16 + x * (-0x2.d997fcp-8 + x * (0x6.8f6d1p-4 + x * (0xf.21285p-4 + x * (-0x7.b5b248p-4 + x * 0x2.32b668p-4)))); // 8.61377e-5 // mp = 0x2.4f5538p-16 + x * (-0x1.abcdecp-8 + x * (0x5.97464p-4 + x * (0x1.399edap0 + x * (-0x1.0d363ap0 + x * (0xa.a54a3p-4 + x * (-0x2.e8a77cp-4)))))); // 3.524655e-5 return(mp); }
However when p < 1
the minimax approximation over the bound 0 ≤ x ≤ 1
does not appropriately converge to the desired accuracy. One option [not really] is to rewrite the problem y=x^p=x^(p+m)/x^m
where m=1,2,3
is a positive integer, making the new power approximation p > 1
but this introduces division which is inherently slower.
There's however another option which is to decompose the input x
as its floating point exponent and mantissa form:
x = mx* 2^(ex) where 1 ≤ mx < 2 y = x^(5/12) = mx^(5/12) * 2^((5/12)*ex), let ey = floor(5*ex/12), k = (5*ex) % 12 = mx^(5/12) * 2^(k/12) * 2^(ey)
The minimax approximation of mx^(5/12)
over 1 ≤ mx < 2
now converges much faster than before, without division, but requires 12 point LUT for the 2^(k/12)
. 代码如下:
float powk_12LUT[] = {0x1.0p0, 0x1.0f38fap0, 0x1.1f59acp0, 0x1.306fep0, 0x1.428a3p0, 0x1.55b81p0, 0x1.6a09e6p0, 0x1.7f910ep0, 0x1.965feap0, 0x1.ae89fap0, 0x1.c823ep0, 0x1.e3437ep0}; float pow5_12(float x){ union{float f; uint32_t u;} v, e2; float poff, m, e, ei; int xe; vf = x; xe = ((vu >> 23) - 127); if(xe < -127) return(0.0f); // Calculate remainder k in 2^(k/12) to find LUT e = xe * (5.0f/12.0f); ei = floorf(e); poff = powk_12LUT[(int)(12.0f * (e - ei))]; e2.u = ((int)ei + 127) << 23; // Calculate the exponent vu = (vu & ~(0xFFuL << 23)) | (0x7FuL << 23); // Normalize exponent to zero // Approximate mx^(5/12) on [1,2), with appropriate degree minimax // m = 0x8.87592p-4 + vf * (0x8.8f056p-4 + vf * (-0x1.134044p-4)); // 7.6125e-4 // m = 0x7.582138p-4 + vf * (0xb.1666bp-4 + vf * (-0x2.d21954p-4 + vf * 0x6.3ea0cp-8)); // 8.4522726e-5 m = 0x6.9465cp-4 + vf * (0xd.43015p-4 + vf * (-0x5.17b2a8p-4 + vf * (0x1.6cb1f8p-4 + vf * (-0x2.c5b76p-8)))); // 1.04091259e-5 // m = 0x6.08242p-4 + vf * (0xf.352bdp-4 + vf * (-0x7.d0c1bp-4 + vf * (0x3.4d153p-4 + vf * (-0xc.f7a42p-8 + vf * 0x1.5d840cp-8)))); // 1.367401e-6 return(m * poff * e2.f); }