浮点math是否被破坏?
0.1 + 0.2 == 0.3 -> false
0.1 + 0.2 -> 0.30000000000000004
为什么会发生?
二进制浮点math就是这样。 在大多数编程语言中,它是基于IEEE 754标准的 。 JavaScript使用64位浮点表示,这与Java的double
相同。 问题的症结在于数字以这种格式表示为一个整数乘以二的幂数; 不能准确表示分母不是二的幂的有理数(如0.1
,即1/10
)。
对于标准的binary64
格式的0.1
,表示可以完全写成
-
0.1000000000000000055511151231257827021181583404541015625
十进制,或 - C99 hexfloat表示法中的
0x1.999999999999ap-4
。
相反,有理数为0.1
,即1/10
,可以完全写成
-
0.1
,十进制或 -
0x1.99999999999999...p-4
在C99 hexfloat表示法的模拟中的0x1.99999999999999...p-4
,其中...
表示9的无止境的序列。
程序中的常量0.2
和0.3
也将近似于它们的真实值。 碰巧最接近0.2
double
大于有理数0.2
但最接近0.3
double
小于有理数0.3
。 0.1
和0.2
的总和大于有理数0.3
,因此不同意代码中的常数。
浮点运算问题的一个相当全面的处理是每个计算机科学家应该知道的浮点运算 。 有关更易于理解的解释,请参阅floating-point-gui.de 。
硬件devise师的视angular
因为我devise和构build浮点硬件,所以我相信我应该增加一个硬件devise师的视angular。 了解错误的起源可能有助于理解软件中发生的事情,最后,我希望这有助于解释浮点错误发生的原因,并似乎随着时间的推移而积累。
1.概述
从工程的angular度来看,大多数浮点运算都会有一些错误的元素,因为执行浮点运算的硬件只需要最后一个错误小于一个单元的一半。 因此,许多硬件将停止在一个精度上,这只需要在单个操作中产生小于一个单元的一半的误差,这在浮点除法中是特别有问题的。 什么构成一个单一的操作取决于该单位需要多less操作数。 大多数情况下,这是两个,但有些单位需要3个或更多的操作数。 因为这个原因,不能保证重复的操作会导致一个理想的错误,因为这些错误会随着时间的推移累积起来。
2.标准
大多数处理器遵循IEEE-754标准,但有些使用非标准化或不同的标准。 例如,在IEEE-754中有一个非规范化的模式,它允许以精度为代价来表示非常小的浮点数。 但是,下面将覆盖作为典型操作模式的IEEE-754的标准化模式。
在IEEE-754标准中,硬件devise者只要小于最后一个单位的一半,就允许任何错误/ epsilon值,并且结果只需要小于最后一个单位的一半一个操作的地方。 这就解释了为什么当重复的操作时,错误加起来。 对于IEEE-754双精度,这是第54位,因为53位用于表示浮点数(例如5.3e5中的5.3)的数字部分(标准化),也称为尾数。 下一节将详细介绍各种浮点操作中硬件错误的原因。
3.划分错误的原因
浮点除法错误的主要原因是用于计算商的除法algorithm。 大多数计算机系统使用乘法运算来计算除法,主要是Z=X/Y
, Z = X * (1/Y)
。 除法是迭代计算的,即每个周期计算商的一些比特,直到达到所需的精度,对于IEEE-754来说是最后一个误差小于一个单位的任何东西。 Y(1 / Y)的倒数表在慢速分割中被称为商select表(QST),而商选表的大小通常是基数的宽度或商的比特数在每次迭代中计算,加上几个警戒位。 对于IEEE-754标准,双精度(64位),这将是分频器的基数的大小,再加上几个保护位k,其中k>=2
。 因此,举例来说,计算一次(基数为4)商的2位的分频器的典型商数select表将是2+2= 4
位(加上几个可选位)。
3.1舍入误差:倒数近似
商数select表中的倒数取决于划分方法 :SRT划分的缓慢划分或Goldschmidt划分的快速划分; 每个条目根据分割algorithm进行修改,以尝试产生尽可能最低的错误。 无论如何,所有的倒数都是实际倒数的近似值 ,并且引入了一些误差因素。 慢速除法和快速除法都是迭代计算商的,即每一步计算一些商的位数,然后将结果从被除数中减去,分频器重复这些步骤,直到误差小于1的一半单位在最后的地方。 慢速分割方法计算每一步中商的固定位数,通常构build成本较低,快速分割方法计算每一步的可变位数,通常构build起来更为昂贵。 划分方法中最重要的部分是,它们中的大多数都依赖于一个倒数近似的重复乘法,所以它们很容易出错。
4.其他操作中的舍入错误:截断
在所有操作中舍入错误的另一个原因是IEEE-754允许的最终答案截断的不同模式。 这里有截断,圆到零, 圆到最近(默认),舍入和舍入。 所有方法在单个操作中引入了最后一个小于一个单位的错误的元素。 随着时间的推移和重复的操作,截断也会累加到最终的错误中。 这个截断误差在幂运算中尤其成问题,它涉及某种forms的重复乘法。
5.重复的操作
由于执行浮点计算的硬件只需要产生一个误差小于单个操作的最后一个单元的一半的结果,所以如果没有观察,错误将在重复的操作上增长。 这就是在计算需要有界误差的情况下,math家们使用IEEE-754 最后一位使用圆到最接近的偶数等方法的原因,因为随着时间的推移,误差更可能相互抵消和间隔算术结合IEEE 754舍入模式的变化来预测舍入误差,并对其进行校正。 由于其相对于其他舍入模式的相对误差较低,舍入到最近的偶数位(最后一位)是IEEE-754的默认舍入模式。
请注意,默认舍入模式( 最后一位的舍入到最接近的偶数位)保证一个操作的最后一个位置的误差小于一个单位的一半。 单独使用截断,舍入和舍入可能会导致错误,大于最后一个单位的一半,但是最后一个单位小于一个单位,所以这些模式不build议,除非它们是用于区间算术。
6.总结
总之,浮点运算错误的根本原因是硬件截断和分频情况下倒数截断的组合。 由于IEEE-754标准对于单个操作只需要最后一个单元的误差小于一个单元的一半,所以重复操作上的浮点误差将加起来,除非被纠正。
当您将.1或1/10转换为基数2(二进制)时,您会在小数点后面获得一个重复模式,就像试图用10来表示1/3一样。该值不准确,因此您无法做到精确的math与它使用正常的浮点方法。
浮点舍入错误。 由于缺less5的素数因子,所以0.1不能在base-2中精确地表示为0.1。正如1/3取以十进制表示无限数字的数字,而在基数3中取为“0.1” 0.1在base-2中的位数不限,以10为底。 而电脑没有无限的内存。
这里的大部分答案都是以非常干燥的技术术语解决这个问题 我想用正常人能理解的方式来解决这个问题。
想象一下,你正在尝试切片比萨饼。 你有一个机器人比萨刀,可以切半披萨片。 它可以将整个披萨减半,或者可以将现有的切片减半,但在任何情况下,减半总是精确的。
那披萨刀的动作非常精细,如果你从一个披萨开始,然后把它减半,每次继续把最小的一半减半,那么在切片太小以至于其高精度的能力之前,你可以减半53次 。 在这一点上,你不能再把这个薄片减半,但必须包括或排除它。
现在,你将如何将所有的切片join比萨饼的十分之一(0.1)或五分之一(0.2)? 真的想一想,并尝试解决。 你甚至可以尝试使用一个真正的比萨,如果你有一个神秘的精密比萨刀在手边。 🙂
当然,大多数有经验的程序员都知道真正的答案,那就是无论你用什么切片来切片,都不可能用这些切片来拼凑比萨饼的十分之一或五分之一。 你可以做一个相当不错的近似值,如果将近似值0.1与近似值0.2相加,则可以得到相当好的近似值0.3,但它仍然只是一个近似值。
对于双精度数字(这是允许您将比萨饼减半53倍的精度),立即小于和大于0.1的数字是0.09999999999999999167332731531132594682276248931884765625和0.1000000000000000055511151231257827021181583404541015625。 后者比前者更接近0.1,所以数字parsing器将会在input为0.1的情况下支持后者。
(这两个数字之间的差别是我们必须决定要么包括的“最小片段”,这引起了向上的偏差,或者排除了,这引起了向下的偏差,最小片段的技术术语是ulp )
在0.2的情况下,数字都是相同的,只是放大了2倍。再次,我们赞成略高于0.2的值。
请注意,在这两种情况下,0.1和0.2的近似值都有轻微的向上偏差。 如果我们添加足够的这些偏差,他们会把这个数字推得更远,离我们想要的更远,事实上,在0.1 + 0.2的情况下,偏差足够高,以致所得的数字不再是最接近的数字到0.3。
尤其是0.1 + 0.2实际上是0.1000000000000000055511151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125 = 0.3000000000000000444089209850062616169452667236328125,而最靠近0.3的数字实际上是0.299999999999999988897769753748434595763683319091796875。
PS有些编程语言还提供比萨刀,可以将切片分割成十分之一 。 虽然这样的比萨刀是不常见的,但是如果你确实能够使用这种比萨刀,那么在能够得到十分之一或十五分之一的切片时,应该使用它。
(最初发布在Quora上。)
除了其他正确答案之外,您可能还需要考虑缩放值以避免浮点algorithm问题。
例如:
var result = 1.0 + 2.0; // result === 3.0 returns true
… 代替:
var result = 0.1 + 0.2; // result === 0.3 returns false
expression式0.1 + 0.2 === 0.3
在JavaScript中返回false
,但幸运的是,浮点数中的整数运算是精确的,因此可以通过缩放来避免十进制表示错误。
作为一个实际的例子,为了避免浮点问题的准确性是至关重要的,build议1把钱作为一个整数来表示分数: 2550
美分,而不是25.50
美元。
1道格拉斯·克罗克福德: JavaScript:好的部分 :附录A – 可怕的部分(第105页) 。
我的答案很长,所以我把它分成三部分。 由于这个问题是关于浮点math的,所以我把重点放在了机器的实际工作上。 我也已经具体说明了双精度(64位),但是这个参数同样适用于任何浮点运算。
前言
IEEE 754双精度二进制浮点格式(二进制64)数字表示一个表格的编号
值=(-1)^ s *(1.m 51 m 50 … m 2 m 1 m 0 ) 2 * 2 e-1023
64位:
- 第一位是符号位 :如果数字是负数,则为1 ,否则为
0
。 - 接下来的11位是指数 ,它被1023 偏移 。换句话说,在从双精度数中读取指数位之后,必须减去1023以获得2的幂。
- 其余的52位是有效数字 (或尾数)。 在尾数中,由于任何二进制值的最高有效位为
1.
所以总是省略“1
。
1 – IEEE 754允许有符号零的概念 – +0
和-0
的处理方式不同: 1 / (+0)
为正无穷; 1 / (-0)
是负无穷。 对于零值,尾数和指数位全部为零。 注意:零值(+0和-0)明确地不归类为反常规2 。
2 – 非正规数字的情况并非如此,偏移指数为零(隐含的0.
)。 反正规双精度数的范围是d min≤| x | ≤d max ,其中d min (最小可表示的非零数)为2 -1023 – 51 ( ≈4.94 * 10 -324 ),d max (尾数最大的非正规数,尾数完全由1
s组成)为2 – 1023 + 1 – 2 -1023 – 51 (≈2.225 * 10 – 308 )。
把一个双精度数转成二进制
许多在线转换器存在将双精度浮点数转换为二进制(例如在binaryconvert.com ),但这里是一些示例C#代码来获得双精度数字的IEEE 754表示(我用冒号分隔三个部分( :
):
public static string BinaryRepresentation(double value) { long valueInLongType = BitConverter.DoubleToInt64Bits(value); string bits = Convert.ToString(valueInLongType, 2); string leadingZeros = new string('0', 64 - bits.Length); string binaryRepresentation = leadingZeros + bits; string sign = binaryRepresentation[0].ToString(); string exponent = binaryRepresentation.Substring(1, 11); string mantissa = binaryRepresentation.Substring(12); return string.Format("{0}:{1}:{2}", sign, exponent, mantissa); }
切入点:原来的问题
(跳到TL; DR版本的底部)
卡托·约翰斯顿 (提问者)问为什么0.1 + 0.2!= 0.3。
用二进制(用冒号分隔三个部分)书写,值的IEEE 754表示是:
0.1 => 0:01111111011:1001100110011001100110011001100110011001100110011010 0.2 => 0:01111111100:1001100110011001100110011001100110011001100110011010
请注意,尾数由0011
的重复数字组成。 这是为什么计算有任何错误的关键 – 0.1,0.2和0.3不能用二进制精确地表示在有限数量的二进制位中,超过1/9,1/3或1/7可以精确表示十进制数字 。
将指数转换为十进制,删除偏移量,并重新添加隐含的1
(在方括号中)0.1和0.2:
0.1 = 2^-4 * [1].1001100110011001100110011001100110011001100110011010 0.2 = 2^-3 * [1].1001100110011001100110011001100110011001100110011010
要添加两个数字,指数需要相同,即:
0.1 = 2^-3 * 0.1100110011001100110011001100110011001100110011001101(0) 0.2 = 2^-3 * 1.1001100110011001100110011001100110011001100110011010 sum = 2^-3 * 10.0110011001100110011001100110011001100110011001100111
由于总和不是forms2 n * 1。{bbb}我们将指数增加1,并移动小数点( 二进制 )得到:
sum = 2^-2 * 1.0011001100110011001100110011001100110011001100110011(1)
尾数有53位(第53位在上面的方括号内)。 IEEE 754的默认舍入模式是“ Round to Nearest ”( 舍入 到最近 ) – 即如果数字x落在两个值a和b之间 ,则select最低有效位为零的值。
a = 2^-2 * 1.0011001100110011001100110011001100110011001100110011 x = 2^-2 * 1.0011001100110011001100110011001100110011001100110011(1) b = 2^-2 * 1.0011001100110011001100110011001100110011001100110100
请注意, a和b仅在最后一位有所不同; ...0011
+ 1
= ...0100
。 在这种情况下,最低有效位为零的值为b ,因此总和为:
sum = 2^-2 * 1.0011001100110011001100110011001100110011001100110100
TL; DR
用IEEE 754二进制表示(用冒号分隔三个部分)写出0.1 + 0.2
,并将其与0.3
进行比较,这是(我把不同的位放在方括号中):
0.1 + 0.2 => 0:01111111101:0011001100110011001100110011001100110011001100110[100] 0.3 => 0:01111111101:0011001100110011001100110011001100110011001100110[011]
转换回十进制,这些值是:
0.1 + 0.2 => 0.300000000000000044408920985006... 0.3 => 0.299999999999999988897769753748...
与原始值相比,差值恰好为2-54,这是〜5.5511151231258×10-17 – 对许多应用而言是微不足道的。
比较浮点数的最后几位本质上是很危险的,因为任何读过着名的“ 每个计算机科学家都应该知道的浮点算术 ”(涵盖了这个答案的所有主要部分)的人都知道。
大多数计算器使用额外的保卫数字来解决这个问题,这就是0.1 + 0.2
如何给出0.3
:最后的几位被四舍五入。
浮点舍入错误。 从每个计算机科学家应该知道什么关于浮点算术 :
将无限多的实数压缩成有限数量的比特需要近似表示。 虽然有无限多的整数,但在大多数程序中,整数计算的结果可以存储在32位中。 相比之下,给定任意固定数量的比特,大多数使用实数的计算将产生无法用该比特精确表示的数量。 因此,浮点计算的结果通常必须四舍五入以适应其有限表示。 这个舍入误差是浮点运算的特征。
存储在计算机中的浮点数由两部分组成,一个整数和一个指数,基数乘以整数部分。
如果计算机工作在10位,则0.1
是1 x 10⁻¹
是2 x 10⁻¹
是3 x 10⁻¹
。 整数运算简单而准确,所以加0.1 + 0.2
显然会导致0.3
。
Computers don't usually work in base 10, they work in base 2. You can still get exact results for some values, for example 0.5
is 1 x 2⁻¹
and 0.25
is 1 x 2⁻²
, and adding them results in 3 x 2⁻²
, or 0.75
. 究竟。
The problem comes with numbers that can be represented exactly in base 10, but not in base 2. Those numbers need to be rounded to their closest equivalent. Assuming the very common IEEE 64-bit floating point format, the closest number to 0.1
is 900719925474099 x 2⁻⁵³
, and the closest number to 0.2
is 1801439850948199 x 2⁻⁵³
; adding them together results in 2702159776422298 x 2⁻⁵³
, or an exact decimal value of 0.3000000000000000444089209850062616169452667236328125
. Floating point numbers are generally rounded for display.
My workaround:
function add(a, b, precision) { var x = Math.pow(10, precision || 2); return (Math.round(a * x) + Math.round(b * x)) / x; }
precision refers to the number of digits you want to preserve after the decimal point during addition.
No, not broken, but most decimal fractions must be approximated
概要
Floating point arithmetic is exact, unfortunately, it doesn't match up well with our usual base-10 number representation, so it turns out we are often giving it input that is slightly off from what we wrote.
Even simple numbers like 0.01, 0.02, 0.03, 0.04 … 0.24 are not representable exactly as binary fractions, even if you had thousands of bits of precision in the mantissa, even if you had millions. If you count off in 0.01 increments, not until you get to 0.25 will you get the first fraction (in this sequence) representable in base 10 and base 2 . But if you tried that using FP, your 0.01 would have been slightly off, so the only way to add 25 of them up to a nice exact 0.25 would have required a long chain of causality involving guard bits and rounding. It's hard to predict so we throw up our hands and say "FP is inexact".
We constantly give the FP hardware something that seems simple in base 10 but is a repeating fraction in base 2.
How did this happen?
When we write in decimal, every fraction is a rational number of the form
x / (2 n + 5 n ).
In binary, we only get the 2 n term, that is:
x / 2 n
So in decimal, we can't represent 1 / 3 . Because base 10 includes 2 as a prime factor, every number we can write as a binary fraction also can be written as a base 10 fraction. However, hardly anything we write as a base 10 fraction is representable in binary. In the range from 0.01, 0.02, 0.03 … 0.99, only three numbers can be represented in our FP format: 0.25, 0.50, and 0.75, because they are 1/4, 1/2, and 3/4, all numbers with a prime factor using only the 2 n term.
In base 10 we can't represent 1 / 3 . But in binary, we can't do 1 / 10 or 1 / 3 .
So while every binary fraction can be written in decimal, the reverse is not true. And in fact most decimal fractions repeat in binary.
Dealing with it
Developers are usually instructed to do < epsilon comparisons, better advice might be to round to integral values (in the C library: round() and roundf(), ie, stay in the FP format) and then compare. Rounding to a specific decimal fraction length solves most problems with output.
Also, on real number-crunching problems (the problems that FP was invented for on early, frightfully expensive computers) the physical constants of the universe and all other measurements are only known to a relatively small number of significant figures, so the entire problem space was "inexact" anyway. FP "accuracy" isn't a problem in this kind of application.
The whole issue really arises when people try to use FP for bean counting. It does work for that, but only if you stick to integral values, which kind of defeats the point of using it. This is why we have all those decimal fraction software libraries.
I love the Pizza answer by Chris , because it describes the actual problem, not just the usual handwaving about "inaccuracy". If FP were simply "inaccurate", we could fix that and would have done it decades ago. The reason we haven't is because the FP format is compact and fast and it's the best way to crunch a lot of numbers. Also, it's a legacy from the space age and arms race and early attempts to solve big problems with very slow computers using small memory systems. (Sometimes, individual magnetic cores for 1-bit storage, but that's another story. )
结论
If you are just counting beans at a bank, software solutions that use decimal string representations in the first place work perfectly well. But you can't do quantum chromodynamics or aerodynamics that way.
Some statistics related to this famous double precision question.
When adding all values ( a + b ) using a step of 0.1 (from 0.1 to 100) we have ~15% chance of precision error . Note that the error could result in slightly bigger or smaller values. 这里有些例子:
0.1 + 0.2 = 0.30000000000000004 (BIGGER) 0.1 + 0.7 = 0.7999999999999999 (SMALLER) ... 1.7 + 1.9 = 3.5999999999999996 (SMALLER) 1.7 + 2.2 = 3.9000000000000004 (BIGGER) ... 3.2 + 3.6 = 6.800000000000001 (BIGGER) 3.2 + 4.4 = 7.6000000000000005 (BIGGER)
When subtracting all values ( a – b where a > b ) using a step of 0.1 (from 100 to 0.1) we have ~34% chance of precision error . 这里有些例子:
0.6 - 0.2 = 0.39999999999999997 (SMALLER) 0.5 - 0.4 = 0.09999999999999998 (SMALLER) ... 2.1 - 0.2 = 1.9000000000000001 (BIGGER) 2.0 - 1.9 = 0.10000000000000009 (BIGGER) ... 100 - 99.9 = 0.09999999999999432 (SMALLER) 100 - 99.8 = 0.20000000000000284 (BIGGER)
*15% and 34% are indeed huge, so always use BigDecimal when precision is of big importance. With 2 decimal digits (step 0.01) the situation worsens a bit more (18% and 36%).
Did you try the duct tape solution?
Try to determine when errors occur and fix them with short if statements, it's not pretty but for some problems it is the only solution and this is one of them.
if( (n * 0.1) < 100.0 ) { return n * 0.1 - 0.000000000000001 ;} else { return n * 0.1 + 0.000000000000001 ;}
I had the same problem in a scientific simulation project in c#, and I can tell you that if you ignore the butterfly effect it's gonna turn to a big fat dragon and bite you in the a**
A lot of good answers was been posted, but I'd like to append one more.
Not all number can be represented via floats / doubles For example, the number "0.2" will be represented as "0.200000003" in single precision in IEEE754 float point standard.
Model for store real numbers under the hood represent float numbers as
Even you can type 0.2
easily, but FLT_RADIX
and DBL_RADIX
is 2 not 10 for computer with FPU which used "IEEE Standard for Binary Floating-Point Arithmetic (ISO/IEEE Std 754-1985)"
So it is a bit hard to represent such numbers exactly. Even if you specify this variable explicitly without any intermediate calculation.
Those weird numbers appear because computers use binary(base 2) number system for calculation purposes, while we use decimal(base 10).
There are a majority of fractional numbers that cannot be represented precisely either in binary or in decimal or both. Result – A rounded up (but precise) number results.
Can I just add; people always assume this to be a computer problem, but if you count with your hands (base 10), you can't get (1/3+1/3=2/3)=true
unless you have infinity to add 0.333… to 0.333… so just as with the 0.1 problem in base 2, you truncate it to 0.333 + 0.333 = 0.666 and probably round it to 0.667 which would be also be technically inaccurate.
Count in ternary, and thirds are not a problem though – maybe some race with 15 fingers on each hand would ask why your decimal math was broken…
Given that nobody has mentioned this…
Some high level languages such as Python and Java come with tools to overcome binary floating point limitations. 例如:
-
Python's
decimal
module and Java'sBigDecimal
class , that represent numbers internally with decimal notation (as opposed to binary notation). Both have limited precision, so they are still error prone, however they solve most common problems with binary floating point arithmetic.Decimals are very nice when dealing with money: ten cents plus twenty cents are always exactly thirty cents:
>>> 0.1 + 0.2 == 0.3 False >>> Decimal('0.1') + Decimal('0.2') == Decimal('0.3') True
Python's
decimal
module is based on IEEE standard 854-1987 . -
Python's
fractions
module and Apache Common'sBigFraction
class . Both represent rational numbers as(numerator, denominator)
pairs and they may give more accurate results than decimal floating point arithmetic.
Neither of these solutions is perfect (especially if we look at performances, or if we require a very high precision), but still they solve a great number of problems with binary floating point arithmetic.
Many of this question's numerous duplicates ask about the effects of floating point rounding on specific numbers. In practice, it is easier to get a feeling for how it works by looking at exact results of calculations of interest rather than by just reading about it. Some languages provide ways of doing that – such as converting a float
or double
to BigDecimal
in Java.
Since this is a language-agnostic question, it needs language-agnostic tools, such as a Decimal to Floating-Point Converter .
Applying it to the numbers in the question, treated as doubles:
0.1 converts to 0.1000000000000000055511151231257827021181583404541015625,
0.2 converts to 0.200000000000000011102230246251565404236316680908203125,
0.3 converts to 0.299999999999999988897769753748434595763683319091796875, and
0.30000000000000004 converts to 0.3000000000000000444089209850062616169452667236328125.
Adding the first two numbers manually or in a decimal calculator such as Full Precision Calculator , shows the exact sum of the actual inputs is 0.3000000000000000166533453693773481063544750213623046875.
If it were rounded down to the equivalent of 0.3 the rounding error would be 0.0000000000000000277555756156289135105907917022705078125. Rounding up to the equivalent of 0.30000000000000004 also gives rounding error 0.0000000000000000277555756156289135105907917022705078125. The round-to-even tie breaker applies.
Returning to the floating point converter, the raw hexadecimal for 0.30000000000000004 is 3fd3333333333334, which ends in an even digit and therefore is the correct result.
The kind of floating-point math that can be implemented in a digital computer necessarily uses an approximation of the real numbers and operations on them. (The standard version runs to over fifty pages of documentation and has a committee to deal with its errata and further refinement.)
This approximation is a mixture of approximations of different kinds, each of which can either be ignored or carefully accounted for due to its specific manner of deviation from exactitude. It also involves a number of explicit exceptional cases at both the hardware and software levels that most people walk right past while pretending not to notice.
If you need infinite precision (using the number π, for example, instead of one of its many shorter stand-ins), you should write or use a symbolic math program instead.
But if you're okay with the idea that sometimes floating-point math is fuzzy in value and logic and errors can accumulate quickly, and you can write your requirements and tests to allow for that, then your code can frequently get by with what's in your FPU.
Just for fun, I played with the representation of floats, following the definitions from the Standard C99 and I wrote the code below.
The code prints the binary representation of floats in 3 separated groups
SIGN EXPONENT FRACTION
and after that it prints a sum, that, when summed with enough precition, it will show the value that really exists in hardware.
So when you write float x = 999...
, the compiler will transform that number in a bit representation printed by the function xx
such that the sum printed by the function yy
be equal to the given number.
In reality, this sum is only an approximation. For the number 999,999,999 the compiler will insert in bit representation of the float the number 1,000,000,000
After the code I attach a console session, in which I compute the sum of terms for both constants (minus PI and 999999999) that really exists in hardware, inserted there by the compiler.
#include <stdio.h> #include <limits.h> void xx(float *x) { unsigned char i = sizeof(*x)*CHAR_BIT-1; do { switch (i) { case 31: printf("sign:"); break; case 30: printf("exponent:"); break; case 23: printf("fraction:"); break; } char b=(*(unsigned long long*)x&((unsigned long long)1<<i))!=0; printf("%d ", b); } while (i--); printf("\n"); } void yy(float a) { int sign=!(*(unsigned long long*)&a&((unsigned long long)1<<31)); int fraction = ((1<<23)-1)&(*(int*)&a); int exponent = (255&((*(int*)&a)>>23))-127; printf(sign?"positive" " ( 1+":"negative" " ( 1+"); unsigned int i = 1<<22; unsigned int j = 1; do { char b=(fraction&i)!=0; b&&(printf("1/(%d) %c", 1<<j, (fraction&(i-1))?'+':')' ), 0); } while (j++, i>>=1); printf("*2^%d", exponent); printf("\n"); } void main() { float x=-3.14; float y=999999999; printf("%lu\n", sizeof(x)); xx(&x); xx(&y); yy(x); yy(y); }
Here is a console session in which I compute the real value of the float that exists in hardware. I used bc to print the sum of terms outputted by the main program. One can insert that sum in python repl or something similar also.
-- .../terra1/stub @ qemacs fc -- .../terra1/stub @ gcc fc -- .../terra1/stub @ ./a.out sign:1 exponent:1 0 0 0 0 0 0 fraction:0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1 sign:0 exponent:1 0 0 1 1 1 0 fraction:0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0 negative ( 1+1/(2) +1/(16) +1/(256) +1/(512) +1/(1024) +1/(2048) +1/(8192) +1/(32768) +1/(65536) +1/(131072) +1/(4194304) +1/(8388608) )*2^1 positive ( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29 -- .../terra1/stub @ bc scale=15 ( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29 999999999.999999446351872
而已。 The value of 999999999 is in fact
999999999.999999446351872
You can also check with bc
that -3.14 is also perturbed. Do not forget to set a scale
factor in bc
.
The displayed sum is what inside the hardware. The value you obtain by computing it depends on the scale you set. I did set the scale
factor to 15. Mathematically, with infinite precision, it seems it is 1,000,000,000.