与项目欧拉速度比较:C vs Python与Erlang vs Haskell
我把Project Euler的 问题#12作为一个编程练习,比较了C,Python,Erlang和Haskell中的我的(当然不是最优的)实现。 为了获得更高的执行时间,我search了第一个有1000个以上因子的三angular形数字,而不是原始问题中所述的500。
结果如下:
C:
lorenzo@enzo:~/erlang$ gcc -lm -o euler12.bin euler12.c lorenzo@enzo:~/erlang$ time ./euler12.bin 842161320 real 0m11.074s user 0m11.070s sys 0m0.000s
python:
lorenzo@enzo:~/erlang$ time ./euler12.py 842161320 real 1m16.632s user 1m16.370s sys 0m0.250s
Python与PyPy:
lorenzo@enzo:~/Downloads/pypy-c-jit-43780-b590cf6de419-linux64/bin$ time ./pypy /home/lorenzo/erlang/euler12.py 842161320 real 0m13.082s user 0m13.050s sys 0m0.020s
二郎:
lorenzo@enzo:~/erlang$ erlc euler12.erl lorenzo@enzo:~/erlang$ time erl -s euler12 solve Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.7.4 (abort with ^G) 1> 842161320 real 0m48.259s user 0m48.070s sys 0m0.020s
哈斯克尔:
lorenzo@enzo:~/erlang$ ghc euler12.hs -o euler12.hsx [1 of 1] Compiling Main ( euler12.hs, euler12.o ) Linking euler12.hsx ... lorenzo@enzo:~/erlang$ time ./euler12.hsx 842161320 real 2m37.326s user 2m37.240s sys 0m0.080s
概要:
- C:100%
- Python:692%(PyPy为118%)
- Erlang:436%(135%感谢RichardC)
- 哈斯克尔:1421%
我认为C有一个很大的优势,因为它用了很长的时间来计算,而不是任意长度的整数。 也不需要首先加载运行时(其他人?)。
问题1: Erlang,Python和Haskell是否因为使用任意长度的整数而失去速度,或者只要值小于MAXINT
它们是否会失去速度?
问题2:为什么Haskell这么慢? 有没有编译器标志,closures刹车或是我的实施? (后者很可能,因为Haskell是一本有七个印章的书。)
问题3:你能不能提供一些提示,如何优化这些实现而不改变我确定因素的方式? 以任何方式优化:更好,更快,更“原生”的语言。
编辑:
问题4:我的function实现是否允许LCO(最后一次呼叫优化,又名尾recursion消除),从而避免在调用堆栈中添加不必要的帧?
尽pipe我不得不承认我的Haskell和Erlang的知识是非常有限的,但我真的试图在四种语言中实现尽可能相似的algorithm。
使用的源代码:
#include <stdio.h> #include <math.h> int factorCount (long n) { double square = sqrt (n); int isquare = (int) square; int count = isquare == square ? -1 : 0; long candidate; for (candidate = 1; candidate <= isquare; candidate ++) if (0 == n % candidate) count += 2; return count; } int main () { long triangle = 1; int index = 1; while (factorCount (triangle) < 1001) { index ++; triangle += index; } printf ("%ld\n", triangle); }
#! /usr/bin/env python3.2 import math def factorCount (n): square = math.sqrt (n) isquare = int (square) count = -1 if isquare == square else 0 for candidate in range (1, isquare + 1): if not n % candidate: count += 2 return count triangle = 1 index = 1 while factorCount (triangle) < 1001: index += 1 triangle += index print (triangle)
-module (euler12). -compile (export_all). factorCount (Number) -> factorCount (Number, math:sqrt (Number), 1, 0). factorCount (_, Sqrt, Candidate, Count) when Candidate > Sqrt -> Count; factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1; factorCount (Number, Sqrt, Candidate, Count) -> case Number rem Candidate of 0 -> factorCount (Number, Sqrt, Candidate + 1, Count + 2); _ -> factorCount (Number, Sqrt, Candidate + 1, Count) end. nextTriangle (Index, Triangle) -> Count = factorCount (Triangle), if Count > 1000 -> Triangle; true -> nextTriangle (Index + 1, Triangle + Index + 1) end. solve () -> io:format ("~p~n", [nextTriangle (1, 1) ] ), halt (0).
factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare) where square = sqrt $ fromIntegral number isquare = floor square factorCount' number sqrt candidate count | fromIntegral candidate > sqrt = count | number `mod` candidate == 0 = factorCount' number sqrt (candidate + 1) (count + 2) | otherwise = factorCount' number sqrt (candidate + 1) count nextTriangle index triangle | factorCount triangle > 1000 = triangle | otherwise = nextTriangle (index + 1) (triangle + index + 1) main = print $ nextTriangle 1 1
在x86_64 Core2 Duo(2.5GHz)机器上使用GHC 7.0.3
, gcc 4.4.6
, Linux 2.6.29
,使用ghc -O2 -fllvm -fforce-recomp
为Haskell编译,使用gcc -O3 -lm
为C.
- 你的C程序运行时间为8.4秒(比你的运行速度快,可能是因为
-O3
) - Haskell解决scheme在36秒内运行(由于
-O2
标志) - 你的
factorCount'
代码没有明确的input,默认为Integer
(感谢Daniel纠正我的错误!)。 给予一个明确的types签名(这是标准的做法无论如何)使用Int
和时间变化为11.1秒 - 在
factorCount'
你不必要地从整体调用。 一个修复结果没有改变,虽然(编译器聪明,你幸运)。 - 你使用
mod
,rem
是更快,更充分。 这将时间更改为8.5秒 。 -
factorCount'
是不断应用两个额外的参数,不会改变(number
,sqrt
)。 工人/包装转换给我们:
$ time ./so 842161320 real 0m7.954s user 0m7.944s sys 0m0.004s
没错, 7.95秒 始终比C解决scheme快半秒 。 如果没有-fllvm
标志,我仍然可以获得8.182 seconds
,所以NCG后端在这种情况下也performance的很好。
结论:Haskell真棒。
导致的代码
factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare) where square = sqrt $ fromIntegral number isquare = floor square factorCount' :: Int -> Int -> Int -> Int -> Int factorCount' number sqrt candidate0 count0 = go candidate0 count0 where go candidate count | candidate > sqrt = count | number `rem` candidate == 0 = go (candidate + 1) (count + 2) | otherwise = go (candidate + 1) count nextTriangle index triangle | factorCount triangle > 1000 = triangle | otherwise = nextTriangle (index + 1) (triangle + index + 1) main = print $ nextTriangle 1 1
编辑:所以,现在我们已经探索了,让我们解决这些问题
问题1:由于使用任意长度的整数,erlang,python和haskell是否会失去速度,或者只要值小于MAXINT,它们就不会丢失速度?
在Haskell中,使用Integer
比Int
慢,但是慢多less取决于所执行的计算。 幸运的是(对于64位机器) Int
就足够了。 为了便于使用,您应该重写我的代码以使用Int64
或Word64
(C不是唯一的语言)。
问题2:为什么Haskell这么慢? 有没有编译器标志,closures刹车或是我的实施? (后者很可能,因为哈斯克尔是一本给我留下七印的书。)
问题3:你能不能提供一些提示,如何优化这些实现而不改变我确定因素的方式? 以任何方式优化:更好,更快,更“原生”的语言。
那是我上面回答的。 答案是
- 0)通过
-O2
使用优化 - 1)尽可能使用快速(特别是:无法使用的)types
- 2)
rem
不是mod
(经常被遗忘的优化)和 - 3)工人/包装转换(也许是最常见的优化)。
问题4:我的function实现是否允许LCO,从而避免在调用堆栈中添加不必要的帧?
是的,这不是问题。 好的工作,很高兴你考虑到这一点。
Erlang实现有一些问题。 作为下面的基准,我的未修改的Erlang程序的执行时间是47.6秒,而C代码是12.7秒。
如果你想运行计算密集的Erlang代码,你应该做的第一件事就是使用本地代码。 使用erlc +native euler12
编译的时间缩短到了41.3秒。 然而,这种代码的本机编译速度比预期的要低得多(只有15%),问题在于你使用了-compile(export_all)
。 这对于实验是有用的,但是所有function都可以从外部访问的事实导致本地编译器非常保守。 (正常的BEAM模拟器没有太大的影响。)用-export([solve/0]).
replace这个声明-export([solve/0]).
提供了更好的加速:31.5秒(几乎从基线的35%)。
但是代码本身存在一个问题:对于factorCount循环中的每次迭代 ,都要执行这个testing:
factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;
C代码不这样做。 一般来说,在相同代码的不同实现之间进行公平的比较是非常棘手的,特别是如果algorithm是数值的,因为你需要确定它们实际上是在做同样的事情。 在某个实现中由于某种types的转换而导致的轻微舍入错误可能会导致它执行比其他更多的迭代,即使两者最终都达到相同的结果。
为了消除这个可能的错误来源(并且在每次迭代中摆脱额外的testing),我重写了factorCount函数,如下所示,精确地模拟C代码:
factorCount (N) -> Sqrt = math:sqrt (N), ISqrt = trunc(Sqrt), if ISqrt == Sqrt -> factorCount (N, ISqrt, 1, -1); true -> factorCount (N, ISqrt, 1, 0) end. factorCount (_N, ISqrt, Candidate, Count) when Candidate > ISqrt -> Count; factorCount ( N, ISqrt, Candidate, Count) -> case N rem Candidate of 0 -> factorCount (N, ISqrt, Candidate + 1, Count + 2); _ -> factorCount (N, ISqrt, Candidate + 1, Count) end.
这个重写,没有export_all
和本地编译,给了我下面的运行时间:
$ erlc +native euler12.erl $ time erl -noshell -s euler12 solve 842161320 real 0m19.468s user 0m19.450s sys 0m0.010s
这与C代码相比并不算太坏:
$ time ./a.out 842161320 real 0m12.755s user 0m12.730s sys 0m0.020s
考虑到Erlang根本不适合编写数字代码,在这样的程序上只比C慢50%,是相当不错的。
最后,关于你的问题:
问题1:由于使用了任意长度的整数,erlang,python和haskell是否宽松,或者只要值小于MAXINT,它们就不是这样吗?
是的,有点。 在Erlang中,没有办法说“使用32/64位算术进行换行”,所以除非编译器能够certificate整数(通常不能),否则必须检查所有的计算才能看到如果他们可以适应一个单一的标签的单词,或者如果它必须把它们变成堆分配bignums。 即使运行时实际上没有使用过这样的元素,这些检查也必须执行。 另一方面,这意味着你知道 ,如果你突然给它比以前更大的input,algorithm将永远不会失败,因为一个意外的整数环绕。
问题4:我的function实现是否允许LCO,从而避免在调用堆栈中添加不必要的帧?
是的,您的Erlang代码在上次呼叫优化方面是正确的。
关于Python的优化,除了使用PyPy(对于你的代码没有什么太大的变化),你可以使用PyPy的翻译工具链来编译一个RPython兼容版本,或者Cython来构build一个扩展模块。比我testing的C版本更快,Cython模块速度快了一倍 。 作为参考,我还包括C和PyPy基准testing结果:
C(用gcc -O3 -lm
编译)
% time ./euler12-c 842161320 ./euler12-c 11.95s user 0.00s system 99% cpu 11.959 total
PyPy 1.5
% time pypy euler12.py 842161320 pypy euler12.py 16.44s user 0.01s system 99% cpu 16.449 total
RPython(使用最新的PyPy版本, c2f583445aee
)
% time ./euler12-rpython-c 842161320 ./euler12-rpy-c 10.54s user 0.00s system 99% cpu 10.540 total
Cython 0.15
% time python euler12-cython.py 842161320 python euler12-cython.py 6.27s user 0.00s system 99% cpu 6.274 total
RPython版本有几个关键的变化。 要翻译成一个独立的程序,你需要定义你的target
,在这种情况下是main
function。 预计接受sys.argv
因为它只是参数,并且需要返回一个int。 您可以使用translate.py, % translate.py euler12-rpython.py
其翻译为C并将其编译为您。
# euler12-rpython.py import math, sys def factorCount(n): square = math.sqrt(n) isquare = int(square) count = -1 if isquare == square else 0 for candidate in xrange(1, isquare + 1): if not n % candidate: count += 2 return count def main(argv): triangle = 1 index = 1 while factorCount(triangle) < 1001: index += 1 triangle += index print triangle return 0 if __name__ == '__main__': main(sys.argv) def target(*args): return main, None
Cython版本被重写为扩展模块_euler12.pyx
,我从一个普通的python文件导入和调用。 _euler12.pyx
与您的版本基本相同,但有一些额外的静态types声明。 setup.py使用python setup.py build_ext --inplace
来构build扩展。
# _euler12.pyx from libc.math cimport sqrt cdef int factorCount(int n): cdef int candidate, isquare, count cdef double square square = sqrt(n) isquare = int(square) count = -1 if isquare == square else 0 for candidate in range(1, isquare + 1): if not n % candidate: count += 2 return count cpdef main(): cdef int triangle = 1, index = 1 while factorCount(triangle) < 1001: index += 1 triangle += index print triangle # euler12-cython.py import _euler12 _euler12.main() # setup.py from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext ext_modules = [Extension("_euler12", ["_euler12.pyx"])] setup( name = 'Euler12-Cython', cmdclass = {'build_ext': build_ext}, ext_modules = ext_modules )
我真的对RPython或者Cython有很less的经验,并且对结果感到惊喜。 如果您使用的是CPython,那么在Cython扩展模块中编写CPU密集的代码似乎是一个非常简单的方法来优化您的程序。
问题3:你能不能提供一些提示,如何优化这些实现而不改变我确定因素的方式? 以任何方式优化:更好,更快,更“原生”的语言。
C实现是次优的(Thomas M. DuBuisson暗示),该版本使用64位整数(即长数据types)。 稍后我将研究汇编列表,但是有一个有教养的猜测,在编译的代码中有一些内存访问正在进行,这使得使用64位整数显着变慢。 就是那个或者是生成的代码(事实上,你可以在SSE寄存器中放入更less的64位整数,或者将双整数放到64位整数中)。
这里是修改的代码(简单地用intreplacelong ,并且明确地插入了factorCount,尽pipe我不认为这对于gcc -O3是必要的):
#include <stdio.h> #include <math.h> static inline int factorCount(int n) { double square = sqrt (n); int isquare = (int)square; int count = isquare == square ? -1 : 0; int candidate; for (candidate = 1; candidate <= isquare; candidate ++) if (0 == n % candidate) count += 2; return count; } int main () { int triangle = 1; int index = 1; while (factorCount (triangle) < 1001) { index++; triangle += index; } printf ("%d\n", triangle); }
运行+计时它给:
$ gcc -O3 -lm -o euler12 euler12.c; time ./euler12 842161320 ./euler12 2.95s user 0.00s system 99% cpu 2.956 total
作为参考,Thomas在之前的回答中的haskell实现给出了:
$ ghc -O2 -fllvm -fforce-recomp euler12.hs; time ./euler12 [9:40] [1 of 1] Compiling Main ( euler12.hs, euler12.o ) Linking euler12 ... 842161320 ./euler12 9.43s user 0.13s system 99% cpu 9.602 total
结论:没有任何东西从ghc,它是一个伟大的编译器,但通常gcc生成更快的代码。
看看这个博客 。 在过去一年左右,他在Haskell和Python中完成了一些Euler项目,他发现Haskell要快得多。 我认为在这些语言之间它更多的与你的stream畅性和编码风格有关。
说到Python的速度,你正在使用错误的实现! 尝试PyPy ,对于这样的事情,你会发现它要快得多。
Haskell实现可以通过使用Haskell包中的一些函数大大加快。 在这种情况下,我使用素数,这是刚刚安装'cabal安装primes';)
import Data.Numbers.Primes import Data.List triangleNumbers = scanl1 (+) [1..] nDivisors n = product $ map ((+1) . length) (group (primeFactors n)) answer = head $ filter ((> 500) . nDivisors) triangleNumbers main :: IO () main = putStrLn $ "First triangle number to have over 500 divisors: " ++ (show answer)
时序:
您的原始程序:
PS> measure-command { bin\012_slow.exe } TotalSeconds : 16.3807409 TotalMilliseconds : 16380.7409
改进的实施
PS> measure-command { bin\012.exe } TotalSeconds : 0.0383436 TotalMilliseconds : 38.3436
正如你所看到的,这个在你运行16秒钟的同一台机器上以38毫秒的速度运行:)
编译命令:
ghc -O2 012.hs -o bin\012.exe ghc -O2 012_slow.hs -o bin\012_slow.exe
只是为了好玩。 下面是一个更加“本地”的Haskell实现:
import Control.Applicative import Control.Monad import Data.Either import Math.NumberTheory.Powers.Squares isInt :: RealFrac c => c -> Bool isInt = (==) <$> id <*> fromInteger . round intSqrt :: (Integral a) => a -> Int --intSqrt = fromIntegral . floor . sqrt . fromIntegral intSqrt = fromIntegral . integerSquareRoot' factorize :: Int -> [Int] factorize 1 = [] factorize n = first : factorize (quot n first) where first = (!! 0) $ [a | a <- [2..intSqrt n], rem na == 0] ++ [n] factorize2 :: Int -> [(Int,Int)] factorize2 = foldl (\ls@((val,freq):xs) y -> if val == y then (val,freq+1):xs else (y,1):ls) [(0,0)] . factorize numDivisors :: Int -> Int numDivisors = foldl (\acc (_,y) -> acc * (y+1)) 1 <$> factorize2 nextTriangleNumber :: (Int,Int) -> (Int,Int) nextTriangleNumber (n,acc) = (n+1,acc+n+1) forward :: Int -> (Int, Int) -> Either (Int, Int) (Int, Int) forward k val@(n,acc) = if numDivisors acc > k then Left val else Right (nextTriangleNumber val) problem12 :: Int -> (Int, Int) problem12 n = (!!0) . lefts . scanl (>>=) (forward n (1,1)) . repeat . forward $ n main = do let (n,val) = problem12 1000 print val
使用ghc -O3
,我的机器(1.73GHz Core i7)在0.55-0.58秒内稳定运行。
C版本更高效的factorCount函数:
int factorCount (int n) { int count = 1; int candidate,tmpCount; while (n % 2 == 0) { count++; n /= 2; } for (candidate = 3; candidate < n && candidate * candidate < n; candidate += 2) if (n % candidate == 0) { tmpCount = 1; do { tmpCount++; n /= candidate; } while (n % candidate == 0); count*=tmpCount; } if (n > 1) count *= 2; return count; }
主要使用gcc -O3 -lm
改变长整数,持续运行在0.31-0.35秒。
如果利用第n个三angular形数= n *(n + 1)/ 2,并且n和(n + 1)具有完全不同的素数因式分解的事实,两者可以更快运行,每一半可以乘以找出整体的因素数目。 下列:
int main () { int triangle = 0,count1,count2 = 1; do { count1 = count2; count2 = ++triangle % 2 == 0 ? factorCount(triangle+1) : factorCount((triangle+1)/2); } while (count1*count2 < 1001); printf ("%lld\n", ((long long)triangle)*(triangle+1)/2); }
将c代码运行时间减less到0.17-0.19秒,并且可以处理更大的search – 在我的机器上超过10000个因素需要大约43秒。 我给感兴趣的读者留下了类似的haskell加速。
问题1:由于使用了任意长度的整数,erlang,python和haskell是否有速度松动,或者只要值小于MAXINT,它们就不是这样吗?
这是不可能的。 关于Erlang和Haskell我不能多说(关于Haskell,下面可能会有一些介绍),但我可以指出Python中的其他瓶颈。 每次程序试图用Python中的一些值执行一个操作时,它应该validation这些值是否来自正确的types,并且花费一点时间。 您的factorCount
函数只是分配一个range (1, isquare + 1)
不同的时间的列表,而运行时, malloc
-styled的内存分配比在C中使用计数器的范围迭代要慢。值得注意的是, factorCount()
被称为多次,所以分配了很多名单。 另外,我们不要忘记,Python是被解释的,而且CPython解释器没有专注于优化。
编辑 :哦,好吧,我注意到你使用Python 3所以range()
不返回一个列表,而是一个生成器。 在这种情况下,我关于分配列表的观点是错误的:函数只是分配range
对象,但效率低下,但效率不如分配许多项目的列表。
问题2:为什么Haskell这么慢? 有没有编译器标志,closures刹车或是我的实施? (后者很可能,因为哈斯克尔是一本给我留下七印的书。)
你在用拥抱吗? 拥抱是一个相当缓慢的翻译。 如果你正在使用它,也许你可以用GHC得到更好的时间 – 但我只是在思索假设,一种好的Haskell编译器在引擎盖下做的东西是相当迷人的,超出了我的理解:)
问题3:你能不能提供一些提示,如何优化这些实现而不改变我确定因素的方式? 以任何方式优化:更好,更快,更“原生”的语言。
我会说你正在玩一个不起眼的游戏。 了解各种语言的最好的部分是使用他们最不同的方式:)但我离题了,我只是没有任何build议这一点。 对不起,我希望有人可以帮助你在这种情况下:)
问题4:我的function实现是否允许LCO,从而避免在调用堆栈中添加不必要的帧?
据我记得,你只需要确保你的recursion调用是返回一个值之前的最后一个命令。 换句话说,像下面这样的函数可以使用这样的优化:
def factorial(n, acc=1): if n > 1: acc = acc * n n = n - 1 return factorial(n, acc) else: return acc
然而,如果你的函数如下所示,你不会有这样的优化,因为在recursion调用之后有一个操作(乘法):
def factorial2(n): if n > 1: f = factorial2(n-1) return f*n else: return 1
我分离了一些局部variables中的操作,以便清楚执行哪些操作。 然而,最常见的是看到这些function如下,但他们是相当于我所做的点:
def factorial(n, acc=1): if n > 1: return factorial(n-1, acc*n) else: return acc def factorial2(n): if n > 1: return n*factorial(n-1) else: return 1
请注意,由编译器/解释器决定是否进行尾recursion。 例如,如果我没有记错的话,Python解释器不会这样做(因为它的stream畅的语法,我在我的示例中使用了Python)。 无论如何,如果你发现了两个参数(如一个参数有acc
, accumulator
等名字)的exception函数,现在你知道为什么人们这样做:)
有了Haskell,你真的不需要明确地考虑recursion。
factorCount number = foldr factorCount' 0 [1..isquare] - (fromEnum $ square == fromIntegral isquare) where square = sqrt $ fromIntegral number isquare = floor square factorCount' candidate | number `rem` candidate == 0 = (2 +) | otherwise = id triangles :: [Int] triangles = scanl1 (+) [1,2..] main = print . head $ dropWhile ((< 1001) . factorCount) triangles
在上面的代码中,我用@常见的列表操作replace@Thomas的回答中的显式recursion。 代码仍然做同样的事情,没有我们担心尾recursion。 它运行的时间( 〜7.49s )比在我的机器上使用GHC 7.6.2的@Thomas的答案( 〜7.04s )慢了6% ,而@Raedwulf的C版本则运行了〜3.15s 。 GHC似乎有所改善。
PS。 我知道这是一个古老的问题,我偶然从谷歌search(我忘记了我正在search,现在…)。 只是想评论一下LCO的问题,并且总体上expression了我对Haskell的感受。 我想评论顶部的答案,但评论不允许代码块。
看看你的Erlang实现。 计时包括整个虚拟机的启动,运行程序和停止虚拟机。 我很确定,设置和停止erlang虚拟机需要一些时间。
如果时间是在erlang虚拟机本身内完成的,那么结果将会不一样,因为在这种情况下,我们只有实际的时间才会有问题的程序。 否则,我相信Erlang Vm的启动和加载过程所花费的总时间以及停止它的时间(就像你在程序中所做的那样)都包含在你用来计算时间的方法的总时间内程序正在输出。 考虑使用erlang时序本身,当我们想要在虚拟机本身的timer:tc/1 or timer:tc/2 or timer:tc/3
使用我们的程序时timer:tc/1 or timer:tc/2 or timer:tc/3
。 这样,来自erlang的结果将排除启动和停止/终止/暂停虚拟机所用的时间。 那是我在那里的推理,想一想,然后再次尝试你的基准。
我实际上build议我们尝试在这些语言的运行时间内为程序(对于具有运行时的语言)计时,以获得精确的值。 举例来说,C没有像Erlang,Python和Haskell那样启动和closures运行时系统的开销(98%肯定这一点 – 我站在更正的位置)。 So (based on this reasoning) i conclude by saying that this benchmark wasnot precise /fair enough for languages running on top of a runtime system. Lets do it again with these changes.
EDIT: besides even if all the languages had runtime systems, the overhead of starting each and halting it would differ. so i suggest we time from within the runtime systems (for the languages for which this applies). The Erlang VM is known to have considerable overhead at start up!
Some more numbers and explanations for the C version. Apparently noone did it during all those years. Remember to upvote this answer so it can get on top for everyone to see and learn.
Step One: Benchmark of the author's programs
Laptop Specifications:
- CPU i3 M380 (931 MHz – maximum battery saving mode)
- 4GB memory
- Win7 64 bits
- Microsoft Visual Studio 2012 Ultimate
- Cygwin with gcc 4.9.3
- Python 2.7.10
命令:
compiling on VS x64 command prompt > `for /f %f in ('dir /b *.c') do cl /O2 /Ot /Ox %f -o %f_x64_vs2012.exe` compiling on cygwin with gcc x64 > `for f in ./*.c; do gcc -m64 -O3 $f -o ${f}_x64_gcc.exe ; done` time (unix tools) using cygwin > `for f in ./*.exe; do echo "----------"; echo $f ; time $f ; done`
。
---------- $ time python ./original.py real 2m17.748s user 2m15.783s sys 0m0.093s ---------- $ time ./original_x86_vs2012.exe real 0m8.377s user 0m0.015s sys 0m0.000s ---------- $ time ./original_x64_vs2012.exe real 0m8.408s user 0m0.000s sys 0m0.015s ---------- $ time ./original_x64_gcc.exe real 0m20.951s user 0m20.732s sys 0m0.030s
Filenames are: integertype_architecture_compiler.exe
- integertype is the same as the original program for now (more on that later)
- architecture is x86 or x64 depending on the compiler settings
- compiler is gcc or vs2012
Step Two: Investigate, Improve and Benchmark Again
VS is 250% faster than gcc. The two compilers should give a similar speed. Obviously, something is wrong with the code or the compiler options. Let's investigate!
The first point of interest is the integer types. Conversions can be expensive and consistency is important for better code generation & optimizations. All integers should be the same type.
It's a mixed mess of int
and long
right now. We're going to improve that. What type to use? The fastest. Gotta benchmark them'all!
---------- $ time ./int_x86_vs2012.exe real 0m8.440s user 0m0.016s sys 0m0.015s ---------- $ time ./int_x64_vs2012.exe real 0m8.408s user 0m0.016s sys 0m0.015s ---------- $ time ./int32_x86_vs2012.exe real 0m8.408s user 0m0.000s sys 0m0.015s ---------- $ time ./int32_x64_vs2012.exe real 0m8.362s user 0m0.000s sys 0m0.015s ---------- $ time ./int64_x86_vs2012.exe real 0m18.112s user 0m0.000s sys 0m0.015s ---------- $ time ./int64_x64_vs2012.exe real 0m18.611s user 0m0.000s sys 0m0.015s ---------- $ time ./long_x86_vs2012.exe real 0m8.393s user 0m0.015s sys 0m0.000s ---------- $ time ./long_x64_vs2012.exe real 0m8.440s user 0m0.000s sys 0m0.015s ---------- $ time ./uint32_x86_vs2012.exe real 0m8.362s user 0m0.000s sys 0m0.015s ---------- $ time ./uint32_x64_vs2012.exe real 0m8.393s user 0m0.015s sys 0m0.015s ---------- $ time ./uint64_x86_vs2012.exe real 0m15.428s user 0m0.000s sys 0m0.015s ---------- $ time ./uint64_x64_vs2012.exe real 0m15.725s user 0m0.015s sys 0m0.015s ---------- $ time ./int_x64_gcc.exe real 0m8.531s user 0m8.329s sys 0m0.015s ---------- $ time ./int32_x64_gcc.exe real 0m8.471s user 0m8.345s sys 0m0.000s ---------- $ time ./int64_x64_gcc.exe real 0m20.264s user 0m20.186s sys 0m0.015s ---------- $ time ./long_x64_gcc.exe real 0m20.935s user 0m20.809s sys 0m0.015s ---------- $ time ./uint32_x64_gcc.exe real 0m8.393s user 0m8.346s sys 0m0.015s ---------- $ time ./uint64_x64_gcc.exe real 0m16.973s user 0m16.879s sys 0m0.030s
Integer types are int
long
int32_t
uint32_t
int64_t
and uint64_t
from #include <stdint.h>
There are LOTS of integer types in C, plus some signed/unsigned to play with, plus the choice to compile as x86 or x64 (not to be confused with the actual integer size). That is a lot of versions to compile and run ^^
Step Three: Understanding the Numbers
Definitive conclusions:
- 32 bits integers are ~200% faster than 64 bits equivalents
- unsigned 64 bits integers are 25 % faster than signed 64 bits (Unfortunately, I have no explanation for that)
Trick question: "What are the sizes of int and long in C?"
The right answer is: The size of int and long in C are not well-defined!
From the C spec:
int is at least 32 bits
long is at least an int
From the gcc man page (-m32 and -m64 flags):
The 32-bit environment sets int, long and pointer to 32 bits and generates code that runs on any i386 system.
The 64-bit environment sets int to 32 bits and long and pointer to 64 bits and generates code for AMD's x86-64 architecture.
From MSDN documentation (Data Type Ranges) https://msdn.microsoft.com/en-us/library/s3f49ktz%28v=vs.110%29.aspx :
int, 4 bytes, also knows as signed
long, 4 bytes, also knows as long int and signed long int
To Conclude This: Lessons Learned
-
32 bits integers are faster than 64 bits integers.
-
Standard integers types are not well defined in C nor C++, they vary depending on compilers and architectures. When you need consistency and predictability, use the
uint32_t
integer family from#include <stdint.h>
. -
Speed issues solved. All other languages are back hundreds percent behind, C & C++ win again ! They always do. The next improvement will be multithreading using OpenMP 😀
Question 1: Do Erlang, Python and Haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?
Question one can be answered in the negative for Erlang. The last question is answered by using Erlang appropriately, as in:
http://bredsaal.dk/learning-erlang-using-projecteuler-net
Since it's faster than your initial C example, I would guess there are numerous problems as others have already covered in detail.
This Erlang module executes on a cheap netbook in about 5 seconds … It uses the network threads model in erlang and, as such demonstrates how to take advantage of the event model. It could be distributed over many nodes. And it's fast. Not my code.
-module(p12dist). -author("Jannich Brendle, jannich@bredsaal.dk, http://blog.bredsaal.dk"). -compile(export_all). server() -> server(1). server(Number) -> receive {getwork, Worker_PID} -> Worker_PID ! {work,Number,Number+100}, server(Number+101); {result,T} -> io:format("The result is: \~w.\~n", [T]); _ -> server(Number) end. worker(Server_PID) -> Server_PID ! {getwork, self()}, receive {work,Start,End} -> solve(Start,End,Server_PID) end, worker(Server_PID). start() -> Server_PID = spawn(p12dist, server, []), spawn(p12dist, worker, [Server_PID]), spawn(p12dist, worker, [Server_PID]), spawn(p12dist, worker, [Server_PID]), spawn(p12dist, worker, [Server_PID]). solve(N,End,_) when N =:= End -> no_solution; solve(N,End,Server_PID) -> T=round(N*(N+1)/2), case (divisor(T,round(math:sqrt(T))) > 500) of true -> Server_PID ! {result,T}; false -> solve(N+1,End,Server_PID) end. divisors(N) -> divisor(N,round(math:sqrt(N))). divisor(_,0) -> 1; divisor(N,I) -> case (N rem I) =:= 0 of true -> 2+divisor(N,I-1); false -> divisor(N,I-1) end.
The test below took place on an: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
~$ time erl -noshell -s p12dist start The result is: 76576500. ^C BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded (v)ersion (k)ill (D)b-tables (d)istribution a real 0m5.510s user 0m5.836s sys 0m0.152s
C++11, < 20ms for me – Run it here
I understand that you want tips to help improve your language specific knowledge, but since that has been well covered here, I thought I would add some context for people who may have looked at the mathematica comment on your question, etc, and wondered why this code was so much slower.
This answer is mainly to provide context to hopefully help people evaluate the code in your question / other answers more easily.
This code uses only a couple of (uglyish) optimisations, unrelated to the language used, based on:
- every traingle number is of the form n(n+1)/2
- n and n+1 are coprime
- the number of divisors is a multiplicative function
#include <iostream> #include <cmath> #include <tuple> #include <chrono> using namespace std; // Calculates the divisors of an integer by determining its prime factorisation. int get_divisors(long long n) { int divisors_count = 1; for(long long i = 2; i <= sqrt(n); /* empty */) { int divisions = 0; while(n % i == 0) { n /= i; divisions++; } divisors_count *= (divisions + 1); //here, we try to iterate more efficiently by skipping //obvious non-primes like 4, 6, etc if(i == 2) i++; else i += 2; } if(n != 1) //n is a prime return divisors_count * 2; else return divisors_count; } long long euler12() { //n and n + 1 long long n, n_p_1; n = 1; n_p_1 = 2; // divisors_x will store either the divisors of x or x/2 // (the later iff x is divisible by two) long long divisors_n = 1; long long divisors_n_p_1 = 2; for(;;) { /* This loop has been unwound, so two iterations are completed at a time * n and n + 1 have no prime factors in common and therefore we can * calculate their divisors separately */ long long total_divisors; //the divisors of the triangle number // n(n+1)/2 //the first (unwound) iteration divisors_n_p_1 = get_divisors(n_p_1 / 2); //here n+1 is even and we total_divisors = divisors_n * divisors_n_p_1; if(total_divisors > 1000) break; //move n and n+1 forward n = n_p_1; n_p_1 = n + 1; //fix the divisors divisors_n = divisors_n_p_1; divisors_n_p_1 = get_divisors(n_p_1); //n_p_1 is now odd! //now the second (unwound) iteration total_divisors = divisors_n * divisors_n_p_1; if(total_divisors > 1000) break; //move n and n+1 forward n = n_p_1; n_p_1 = n + 1; //fix the divisors divisors_n = divisors_n_p_1; divisors_n_p_1 = get_divisors(n_p_1 / 2); //n_p_1 is now even! } return (n * n_p_1) / 2; } int main() { for(int i = 0; i < 1000; i++) { using namespace std::chrono; auto start = high_resolution_clock::now(); auto result = euler12(); auto end = high_resolution_clock::now(); double time_elapsed = duration_cast<milliseconds>(end - start).count(); cout << result << " " << time_elapsed << '\n'; } return 0; }
That takes around 19ms on average for my desktop and 80ms for my laptop, a far cry from most of the other code I've seen here. And there are, no doubt, many optimisations still available.
Trying GO:
package main import "fmt" import "math" func main() { var n, m, c int for i := 1; ; i++ { n, m, c = i * (i + 1) / 2, int(math.Sqrt(float64(n))), 0 for f := 1; f < m; f++ { if n % f == 0 { c++ } } c *= 2 if m * m == n { c ++ } if c > 1001 { fmt.Println(n) break } } }
我得到:
original c version: 9.1690 100%
go: 8.2520 111%
But using:
package main import ( "math" "fmt" ) // Sieve of Eratosthenes func PrimesBelow(limit int) []int { switch { case limit < 2: return []int{} case limit == 2: return []int{2} } sievebound := (limit - 1) / 2 sieve := make([]bool, sievebound+1) crosslimit := int(math.Sqrt(float64(limit))-1) / 2 for i := 1; i <= crosslimit; i++ { if !sieve[i] { for j := 2 * i * (i + 1); j <= sievebound; j += 2*i + 1 { sieve[j] = true } } } plimit := int(1.3*float64(limit)) / int(math.Log(float64(limit))) primes := make([]int, plimit) p := 1 primes[0] = 2 for i := 1; i <= sievebound; i++ { if !sieve[i] { primes[p] = 2*i + 1 p++ if p >= plimit { break } } } last := len(primes) - 1 for i := last; i > 0; i-- { if primes[i] != 0 { break } last = i } return primes[0:last] } func main() { fmt.Println(p12()) } // Requires PrimesBelow from utils.go func p12() int { n, dn, cnt := 3, 2, 0 primearray := PrimesBelow(1000000) for cnt <= 1001 { n++ n1 := n if n1%2 == 0 { n1 /= 2 } dn1 := 1 for i := 0; i < len(primearray); i++ { if primearray[i]*primearray[i] > n1 { dn1 *= 2 break } exponent := 1 for n1%primearray[i] == 0 { exponent++ n1 /= primearray[i] } if exponent > 1 { dn1 *= exponent } if n1 == 1 { break } } cnt = dn * dn1 dn = dn1 } return n * (n - 1) / 2 }
我得到:
original c version: 9.1690 100%
thaumkid's c version: 0.1060 8650%
first go version: 8.2520 111%
second go version: 0.0230 39865%
I also tried Python3.6 and pypy3.3-5.5-alpha:
original c version: 8.629 100%
thaumkid's c version: 0.109 7916%
Python3.6: 54.795 16%
pypy3.3-5.5-alpha: 13.291 65%
and then with following code I got:
original c version: 8.629 100%
thaumkid's c version: 0.109 8650%
Python3.6: 1.489 580%
pypy3.3-5.5-alpha: 0.582 1483%
def D(N): if N == 1: return 1 sqrtN = int(N ** 0.5) nf = 1 for d in range(2, sqrtN + 1): if N % d == 0: nf = nf + 1 return 2 * nf - (1 if sqrtN**2 == N else 0) L = 1000 Dt, n = 0, 0 while Dt <= L: t = n * (n + 1) // 2 Dt = D(n/2)*D(n+1) if n%2 == 0 else D(n)*D((n+1)/2) n = n + 1 print (t)
Change: case (divisor(T,round(math:sqrt(T))) > 500) of
To: case (divisor(T,round(math:sqrt(T))) > 1000) of
This will produce the correct answer for the Erlang multi-process example.
I made the assumption that the number of factors is only large if the numbers involved have many small factors. So I used thaumkid's excellent algorithm, but first used an approximation to the factor count that is never too small. It's quite simple: Check for prime factors up to 29, then check the remaining number and calculate an upper bound for the nmber of factors. Use this to calculate an upper bound for the number of factors, and if that number is high enough, calculate the exact number of factors.
The code below doesn't need this assumption for correctness, but to be fast. It seems to work; only about one in 100,000 numbers gives an estimate that is high enough to require a full check.
代码如下:
// Return at least the number of factors of n. static uint64_t approxfactorcount (uint64_t n) { uint64_t count = 1, add; #define CHECK(d) \ do { \ if (n % d == 0) { \ add = count; \ do { n /= d; count += add; } \ while (n % d == 0); \ } \ } while (0) CHECK ( 2); CHECK ( 3); CHECK ( 5); CHECK ( 7); CHECK (11); CHECK (13); CHECK (17); CHECK (19); CHECK (23); CHECK (29); if (n == 1) return count; if (n < 1ull * 31 * 31) return count * 2; if (n < 1ull * 31 * 31 * 37) return count * 4; if (n < 1ull * 31 * 31 * 37 * 37) return count * 8; if (n < 1ull * 31 * 31 * 37 * 37 * 41) return count * 16; if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43) return count * 32; if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47) return count * 64; if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53) return count * 128; if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59) return count * 256; if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61) return count * 512; if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61 * 67) return count * 1024; if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61 * 67 * 71) return count * 2048; if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61 * 67 * 71 * 73) return count * 4096; return count * 1000000; } // Return the number of factors of n. static uint64_t factorcount (uint64_t n) { uint64_t count = 1, add; CHECK (2); CHECK (3); uint64_t d = 5, inc = 2; for (; d*d <= n; d += inc, inc = (6 - inc)) CHECK (d); if (n > 1) count *= 2; // n must be a prime number return count; } // Prints triangular numbers with record numbers of factors. static void printrecordnumbers (uint64_t limit) { uint64_t record = 30000; uint64_t count1, factor1; uint64_t count2 = 1, factor2 = 1; for (uint64_t n = 1; n <= limit; ++n) { factor1 = factor2; count1 = count2; factor2 = n + 1; if (factor2 % 2 == 0) factor2 /= 2; count2 = approxfactorcount (factor2); if (count1 * count2 > record) { uint64_t factors = factorcount (factor1) * factorcount (factor2); if (factors > record) { printf ("%lluth triangular number = %llu has %llu factors\n", n, factor1 * factor2, factors); record = factors; } } } }
This finds the 14,753,024th triangular with 13824 factors in about 0.7 seconds, the 879,207,615th triangular number with 61,440 factors in 34 seconds, the 12,524,486,975th triangular number with 138,240 factors in 10 minutes 5 seconds, and the 26,467,792,064th triangular number with 172,032 factors in 21 minutes 25 seconds (2.4GHz Core2 Duo), so this code takes only 116 processor cycles per number on average. The last triangular number itself is larger than 2^68, so
I modified "Jannich Brendle" version to 1000 instead 500. And list the result of euler12.bin, euler12.erl, p12dist.erl. Both erl codes use '+native' to compile.
zhengs-MacBook-Pro:workspace zhengzhibin$ time erl -noshell -s p12dist start The result is: 842161320. real 0m3.879s user 0m14.553s sys 0m0.314s zhengs-MacBook-Pro:workspace zhengzhibin$ time erl -noshell -s euler12 solve 842161320 real 0m10.125s user 0m10.078s sys 0m0.046s zhengs-MacBook-Pro:workspace zhengzhibin$ time ./euler12.bin 842161320 real 0m5.370s user 0m5.328s sys 0m0.004s zhengs-MacBook-Pro:workspace zhengzhibin$
#include <stdio.h> #include <math.h> int factorCount (long n) { double square = sqrt (n); int isquare = (int) square+1; long candidate = 2; int count = 1; while(candidate <= isquare && candidate<=n){ int c = 1; while (n % candidate == 0) { c++; n /= candidate; } count *= c; candidate++; } return count; } int main () { long triangle = 1; int index = 1; while (factorCount (triangle) < 1001) { index ++; triangle += index; } printf ("%ld\n", triangle); }
gcc -lm -Ofast euler.c
time ./a.out
2.79s user 0.00s system 99% cpu 2.794 total