从Bash中另一个更大的文本文件中find文本文件的行的最快方法
我有两个文件, file1.txt
和file2.txt
。 file1.txt
约有14K行, file2.txt
约有20亿。 file1.txt
每行有一个字段f1
,而file2.txt
有3个字段,从f1
到f3
,由|
。
我想从file2.txt
中find所有行,其中file1.txt
f2
与file2.txt
f2
匹配(或者如果我们不想花费额外的时间分割file2.txt
的值,则file2.txt
)。
file1.txt(约14K行, 未sorting ):
foo1 foo2 ... bar1 bar2 ...
file2.txt(约20亿行, 未sorting ):
date1|foo1|number1 date2|foo2|number2 date3|foo3|number3 ... date1|bar1|number1 date2|bar2|number2 date3|bar3|number3
预期产出:
date1|foo1|number1 date2|foo2|number2 date1|bar1|number1 date2|bar2|number2 ...
这是我已经尝试过,似乎需要几个小时才能运行:
fgrep -F -f file1.txt file2.txt > file.matched
我想知道是否有一个更好,更快的方式来执行这个操作与普通的Unix命令或一个小脚本。
一个Perl解决scheme。 [请参阅下面的更新和注释 。]
对第一个文件使用散列。 当你逐行阅读大文件时,通过正则expression式提取字段(捕获||
之间的第一个模式)或split
(获取第二个字)并打印(如果exists
。 他们可能在速度上有点不同(时间)。 defined
检查在正则expression式中是不需要的,而对于split
使用//
(已定义)或短路。
use warnings; use strict; # If 'prog smallfile bigfile' is the preferred use die "Usage: $0 smallfile bigfile" if @ARGV != 2; my ($smallfile, $bigfile) = @ARGV; open my $fh, '<', $smallfile or die "Can't open $smallfile: $!"; my %word = map { chomp; $_ => 1 } <$fh>; open $fh, '<', $bigfile or die "Can't open $bigfile: $!"; while (<$fh>) { exists $word{ (/\|([^|]+)/)[0] } && print; # Or #exists $word{ (split /\|/)[1] // '' } && print; } close $fh;
避免if
分支和使用短路更快,但只有很less。 在数十亿线上,这些调整加起来,但不要太多。 它可能会(也可能不会)稍微快一点点读取小文件行,而不是在上面的列表上下文,但这不应该引人注目。
更新写入STDOUT
保存了两个操作,我反复将其logging速度比写入文件快一点。 这种用法也与大多数UNIX工具一致,所以我改为写入STDOUT
。 接下来, exists
testing是不需要的,并放弃它的操作。 不过,我始终能够更好地运用它 ,同时也更好地传达了目的。 总而言之,我要离开它了。感谢ikegami的意见。
注意注释掉的版本比其他版本快了50% ,按照我的基准testing。 这些都是因为他们不同而有所不同 ,一个find第一个匹配,另一个find第二个领域。 我把它作为一个更普遍的select,因为这个问题是模棱两可的。
一些比较(基准)[更新为写入STDOUT
,请参阅上面的“更新”]
HåkonHægland的答案中有一个广泛的分析,大多数解决scheme的一次运行。 下面是另一个需要考虑的问题,上面提到的两个解决scheme,OP的自己的答案,以及已经发布的fgrep
,预计会很快,并在问题和许多答案中使用。
我以下面的方式构buildtesting数据。 一些大致如图所示的长度的行是用随机的字来表示的,这两个文件都是在第二个字段中匹配的。 然后我用数据样本填充这个“种子”,用不匹配的行来模拟大小和OP所引用的匹配之间的比率:对于小文件中的14K行,大文件中有130万行,产生126K个匹配。 然后重复写这些样本以构build完整的数据文件作为OP,每次使用List :: Util进行 shuffle
。
所有运行比较下面产生106_120
匹配上述文件大小( diff
检查),所以匹配的频率是足够接近。 他们通过使用my $res = timethese(60 ...)
调用完整的程序来进行基准testing。 在cmpthese($res)
的结果是
利率正则expression式cfor分裂fgrep 正则expression式1.05 / s - -23%-35%-44% cfor 1.36 / s 30% - -16%-28% 分1.62 /秒54%19% - -14% fgrep 1.89 / s 80%39%17% -
事实上,优化的C编程fgrep
出现在顶端并不奇怪。 “ 正则expression式 ”落后于“ 分裂 ”可能是由于多次启动小火柴引擎的开销。 鉴于不断发展的正则expression式引擎优化,这可能会因Perl版本而有所不同。 我声称自己的答案是最快的,其中包括@codeforester答案(“cfor”),其20%
落后于非常类似的“ 分裂 ”可能是由于零散的小的低效率(请参阅下面的评论)。
这并不是完全不同的,而硬件和软件以及数据细节上的确有不同。 我在不同的Perls和机器上运行它,显着的区别是在某些情况下, fgrep
确实快了一个数量级 。
OP的fgrep
非常慢的经验令人惊讶。 考虑到他们引用的运行时间,比上面慢几个数量级,我猜测有一个旧的系统“责怪”。
尽pipe这是完全基于I / O的,但是将它放在多个内核上还是会带来并发的好处,我希望能有一个很好的加速,达到几个因素。
我试图对这里介绍的一些方法进行比较。
首先,我创build了一个Perl脚本来生成input文件file1.txt
和file2.txt
。 为了比较一些解决scheme,我确信file1.txt
的单词只能出现在file2.txt
的第二个字段中。 另外为了能够使用@GeorgeVasiliou提供的联接解决scheme,我对file1.txt
和file2.txt
进行了sorting。 目前我只基于75个随机单词(取自https://www.randomlists.com/random-words )生成input文件。 在file1.txt
中只使用了这file2.txt
个单词中的file2.txt
,其余的七十个单词用来填充file2.txt
的字段。 可能有必要大量增加单词的数量以获得切合实际的结果(根据OP,原文件file1.txt
包含14000个单词)。 在下面的testing中,我用了1000000(100万)行的file2.txt
。 该脚本还生成regexp1.txt
的grep解决scheme所需的文件regexp1.txt。
gen_input_files.pl :
#! /usr/bin/env perl use feature qw(say); use strict; use warnings; use Data::Printer; use Getopt::Long; GetOptions ("num_lines=i" => \my $nlines ) or die("Error in command line arguments\n"); # Generated random words from site: https://www.randomlists.com/random-words my $word_filename = 'words.txt'; # 75 random words my $num_match_words = 5; my $num_file2_lines = $nlines || 1_000_000; my $file2_words_per_line = 3; my $file2_match_field_no = 2; my $file1_filename = 'file1.txt'; my $file2_filename = 'file2.txt'; my $file1_regex_fn = 'regexp1.txt'; say "generating $num_file2_lines lines.."; my ( $words1, $words2 ) = get_words( $word_filename, $num_match_words ); write_file1( $file1_filename, $words2 ); write_file2( $file2_filename, $words1, $words2, $num_file2_lines, $file2_words_per_line, $file2_match_field_no ); write_BOC_regexp_file( $file1_regex_fn, $words2 ); sub write_BOC_regexp_file { my ( $fn, $words ) = @_; open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!"; print $fh '\\|' . (join "|", @$words) . '\\|'; close $fh; } sub write_file2 { my ( $fn, $words1, $words2, $nlines, $words_per_line, $field_no ) = @_; my $nwords1 = scalar @$words1; my $nwords2 = scalar @$words2; my @lines; for (1..$nlines) { my @words_line; my $key; for (1..$words_per_line) { my $word; if ( $_ != $field_no ) { my $index = int (rand $nwords1); $word = @{ $words1 }[$index]; } else { my $index = int (rand($nwords1 + $nwords2) ); if ( $index < $nwords2 ) { $word = @{ $words2 }[$index]; } else { $word = @{ $words1 }[$index - $nwords2]; } $key = $word; } push @words_line, $word; } push @lines, [$key, (join "|", @words_line)]; } @lines = map { $_->[1] } sort { $a->[0] cmp $b->[0] } @lines; open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!"; print $fh (join "\n", @lines); close $fh; } sub write_file1 { my ( $fn, $words ) = @_; open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!"; print $fh (join "\n", sort @$words); close $fh; } sub get_words { my ( $fn, $N ) = @_; open( my $fh, '<', $fn ) or die "Could not open file '$fn': $!"; my @words = map {chomp $_; $_} <$fh>; close $fh; my @words1 = @words[$N..$#words]; my @words2 = @words[0..($N - 1)]; return ( \@words1, \@words2 ); }
接下来,我创build了一个包含所有testing用例的子文件夹solutions
:
$ tree solutions/ solutions/ ├── BOC1 │ ├── out.txt │ └── run.sh ├── BOC2 │ ├── out.txt │ └── run.sh ├── codeforester │ ├── out.txt │ ├── run.pl │ └── run.sh [...]
这里的文件out.txt
是每个解决schemegreps的输出。 脚本run.sh
运行给定testing用例的解决scheme。
注意不同的解决scheme
-
BOC1
:@BOC提出的第一个解决schemegrep -E -f regexp1.txt file2.txt
-
BOC2
:@BOCbuild议的第二个解决scheme:LC_ALL=C grep -E -f regexp1.txt file2.txt
-
codeforester
:由@codeforester接受Perl解决scheme(请参阅源代码 ) -
codeforester_orig
:由@codeforested提供的原始解决scheme:fgrep -f file1.txt file2.txt
-
dawg
:使用@dawg提供的字典和分割线的Python解决scheme(参见源代码 ) -
gregory1
:使用@gregorybuild议的Gnu Parallel的解决schemeparallel -k --pipepart -a file2.txt --block "$block_size" fgrep -F -f file1.txt
请参阅下面有关如何select
$block_size
注释。 -
hakon1
:由@HåkonHægland提供的Perl解决scheme(请参阅参考资料)。 此解决scheme需要在首次运行代码时编译c-extension。 当file1.txt
或file2.txt
更改时,不需要重新编译。 注意:下面列出的运行时间中不包括初始运行时用于编译c延伸的时间。 -
ikegami
:解决scheme使用汇编正则expression式和使用grep -P
由@ikegami给出。 注意:汇编正则expression式被写入到一个单独的文件regexp_ikegami.txt
,所以生成正则expression式的运行时间不包含在下面的比较中。 这是使用的代码:regexp=$(< "regexp_ikegami.txt") grep -P "$regexp" file2.txt
-
inian1
:@Inian使用match()
第一个解决schemeawk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (match($0,i)) {print; break} }' file1.txt FS='|' file2.txt
-
inian2
:通过@Inian使用index()
第二个解决schemeawk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (index($0,i)) {print; break} }' file1.txt FS='|' file2.txt
-
inian3
:第三个解决scheme@Inian只检查$2
字段:awk 'FNR==NR{ hash[$1]; next } $2 in hash' file1.txt FS='|' file2.txt
-
inian4
:第四次使用@Inian(基本上和LC_ALL
codeforester_orig
一样):LC_ALL=C fgrep -f file1.txt file2.txt
-
inian5
:inian5
第五个解决scheme(与inian1
相同,但是LC_ALL
):LC_ALL=C awk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (match($0,i)) {print; break} }' file1.txt FS='|' file2.txt
-
inian6
:与inian6
相同,但LC_ALL=C
感谢@GeorgeVasiliou的build议。 -
jjoao
:编译由@JJaoao提供的flex生成的C代码(参见源代码 )。 注意:每次file1.txt
更改时,必须重新编译exectuable。 编译可执行文件的时间不包含在下面的运行时间中。 -
oliv:由@oliv提供的Python脚本(参见源代码 )
-
Vasiliou
:按照@GeorgeVasiliou的build议使用join
:join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 file1.txt file2.txt
-
Vasiliou2
:和Vasiliou
一样,但LC_ALL=C
-
zdim
:使用zdim
提供的Perl脚本(参见源代码 )。 注意:这使用正则expression式search版本(而不是分割线解决scheme)。 -
zdim2
:与zdim2
相同,除了它使用split
function而不是正则expression式searchfile2.txt
的字段。
笔记
-
我尝试了一下Gnu并行(见上面的
gregory1
解决scheme),以确定我的CPU的最佳块大小。 我有4个内核,而且目前看来,最佳select是将文件(file2.txt
)分成4个相同大小的块,并在4个处理器中的每一个上运行一个作业。 这里可能需要更多的testing。 因此,对于file2.txt
为20M的第一个testing用例,我将$block_size
设置$block_size
5M(参见上面的gregory1
解决scheme),而对于下面介绍的更为真实的情况,其中file2.txt
是file2.txt
,使用67M的$block_size
。 -
解决scheme
BOC1
,BOC2
,codeforester_orig
,inian1
,inian4
,inian5
和gregory1
都使用了松散匹配。 这意味着file1.txt
中的单词不必完全匹配file2.txt
第2个file2.txt
。 线上任何地方的比赛都被接受了。 由于这种行为使得与其他方法相比更加困难,所以也引入了一些修改的方法。 前两个名为BOC1B
和BOC2B
方法使用了修改的regexp1.txt
文件。 原始regexp1.txt
中的regexp1.txt
窗体\|foo1|foo2|...|fooN\|
这将匹配在任何领域的边界的话。 修改过的文件regexp1b.txt
使用格式^[^|]*\|foo1|foo2|...|fooN\|
来匹配字段#2^[^|]*\|foo1|foo2|...|fooN\|
代替。然后,修改后的方法
codeforester_origB
,inian1B
,inian4B
,inian5B
和gregory1B
使用修改的file1.txt
。 修改后的文件file1b.txt
在表单上每行使用一个正则expression式 ,而不是每行一个字面值 :^[^|]*\|word1\| ^[^|]*\|word2\| ^[^|]*\|word3\| [...]
另外,
fgrep -f
被grep -E -f
replace为这些方法。
运行testing
这是用于运行所有testing的脚本。 它使用Bash time
命令来logging每个脚本所花费的时间。 请注意, time
命令会返回三个不同的时间,分别称为real
, user
和sys
。 首先我使用了user
+ sys
,但是在使用Gnu并行命令时意识到这是不正确的,所以下面报告的时间现在是time
返回的real
部分。 请参阅此问题以获取有关按时间返回的不同时间的更多信息。
第一个testing使用包含5行的file1.txt
和包含1000000
行的file2.txt
运行。 这是run_all.pl
脚本的前52行,脚本的其余部分在这里可用。
run_all.pl
#! /usr/bin/env perl use feature qw(say); use strict; use warnings; use Cwd; use Getopt::Long; use Data::Printer; use FGB::Common; use List::Util qw(max shuffle); use Number::Bytes::Human qw(format_bytes); use Sys::Info; GetOptions ( "verbose" => \my $verbose, "check" => \my $check, "single-case=s" => \my $case, "expected=i" => \my $expected_no_lines, ) or die("Error in command line arguments\n"); my $test_dir = 'solutions'; my $output_file = 'out.txt'; my $wc_expected = $expected_no_lines; # expected number of output lines my $tests = get_test_names( $test_dir, $case ); my $file2_size = get_file2_size(); my $num_cpus = Sys::Info->new()->device( CPU => () )->count; chdir $test_dir; my $cmd = 'run.sh'; my @times; for my $case (@$tests) { my $savedir = getcwd(); chdir $case; say "Running '$case'.."; my $arg = get_cmd_args( $case, $file2_size, $num_cpus ); my $output = `bash -c "{ time -p $cmd $arg; } 2>&1"`; my ($user, $sys, $real ) = get_run_times( $output ); print_timings( $user, $sys, $real ) if $verbose; check_output_is_ok( $output_file, $wc_expected, $verbose, $check ); print "\n" if $verbose; push @times, $real; #push @times, $user + $sys; # this is wrong when using Gnu parallel chdir $savedir; } say "Done.\n"; print_summary( $tests, \@times );
结果
以下是运行testing的输出:
$ run_all.pl --verbose Running 'inian3'.. ..finished in 0.45 seconds ( user: 0.44, sys: 0.00 ) ..no of output lines: 66711 Running 'inian2'.. ..finished in 0.73 seconds ( user: 0.73, sys: 0.00 ) ..no of output lines: 66711 Running 'Vasiliou'.. ..finished in 0.09 seconds ( user: 0.08, sys: 0.00 ) ..no of output lines: 66711 Running 'codeforester_orig'.. ..finished in 0.05 seconds ( user: 0.05, sys: 0.00 ) ..no of output lines: 66711 Running 'codeforester'.. ..finished in 0.45 seconds ( user: 0.44, sys: 0.01 ) ..no of output lines: 66711 [...]
概要
[@Vasiliou获得的结果显示在中间栏。]
|Vasiliou My Benchmark |Results | Details -------------------------------|---------|---------------------- inian4 : 0.04s |0.22s | LC_ALL fgrep -f [loose] codeforester_orig : 0.05s | | fgrep -f [loose] Vasiliou2 : 0.06s |0.16s | [LC_ALL join [requires sorted files]] BOC1 : 0.06s | | grep -E [loose] BOC2 : 0.07s |15s | LC_ALL grep -E [loose] BOC2B : 0.07s | | LC_ALL grep -E [strict] inian4B : 0.08s | | LC_ALL grep -E -f [strict] Vasiliou : 0.08s |0.23s | [join [requires sorted files]] gregory1B : 0.08s | | [parallel + grep -E -f [strict]] ikegami : 0.1s | | grep -P gregory1 : 0.11s |0.5s | [parallel + fgrep -f [loose]] hakon1 : 0.14s | | [perl + c] BOC1B : 0.14s | | grep -E [strict] jjoao : 0.21s | | [compiled flex generated c code] inian6 : 0.26s |0.7s | [LC_ALL awk + split+dict] codeforester_origB : 0.28s | | grep -E -f [strict] dawg : 0.35s | | [python + split+dict] inian3 : 0.44s |1.1s | [awk + split+dict] zdim2 : 0.4s | | [perl + split+dict] codeforester : 0.45s | | [perl + split+dict] oliv : 0.5s | | [python + compiled regex + re.search()] zdim : 0.61s | | [perl + regexp+dict] inian2 : 0.73s |1.7s | [awk + index($0,i)] inian5 : 18.12s | | [LC_ALL awk + match($0,i) [loose]] inian1 : 19.46s | | [awk + match($0,i) [loose]] inian5B : 42.27s | | [LC_ALL awk + match($0,i) [strict]] inian1B : 85.67s | | [awk + match($0,i) [strict]] Vasiliou Results : 2 X CPU Intel 2 Duo T6570 @ 2.10GHz - 2Gb RAM-Debian Testing 64bit- kernel 4.9.0.1 - no cpu freq scaling.
一个更现实的testing案例
然后,我创build了一个更具现实性的案例,其中包含100个字的file1.txt
和1000万行(268Mb文件大小)的file2.txt。 我用shuf -n1000 /usr/share/dict/american-english > words.txt
从/usr/share/dict/american-english
的字典中提取了1000个随机单词,然后将这100个单词提取到file1.txt
,然后构buildfile2.txt
与第一个testing用例的描述相同。 请注意,字典文件是UTF-8编码的,我从words.txt
去除了所有非ASCII字符。
然后我运行testing,没有从前面的情况下的三个最慢的方法。 即inian1
, inian2
和inian5
被排除在外。 以下是新的结果:
gregory1 : 0.86s | [parallel + fgrep -f [loose]] Vasiliou2 : 0.94s | [LC_ALL join [requires sorted files]] inian4B : 1.12s | LC_ALL grep -E -f [strict] BOC2B : 1.13s | LC_ALL grep -E [strict] BOC2 : 1.15s | LC_ALL grep -E [loose] BOC1 : 1.18s | grep -E [loose] ikegami : 1.33s | grep -P Vasiliou : 1.37s | [join [requires sorted files]] hakon1 : 1.44s | [perl + c] inian4 : 2.18s | LC_ALL fgrep -f [loose] codeforester_orig : 2.2s | fgrep -f [loose] inian6 : 2.82s | [LC_ALL awk + split+dict] jjoao : 3.09s | [compiled flex generated c code] dawg : 3.54s | [python + split+dict] zdim2 : 4.21s | [perl + split+dict] codeforester : 4.67s | [perl + split+dict] inian3 : 5.52s | [awk + split+dict] zdim : 6.55s | [perl + regexp+dict] gregory1B : 45.36s | [parallel + grep -E -f [strict]] oliv : 60.35s | [python + compiled regex + re.search()] BOC1B : 74.71s | grep -E [strict] codeforester_origB : 75.52s | grep -E -f [strict]
注意
基于grep
的解决scheme正在寻找整条线上的匹配,所以在这种情况下,它们包含一些错误匹配: codeforester_orig
, BOC1
, BOC2
, gregory1
, inian4
和oliv的方法从10,000,000行中提取1,087,609行,而其他方法从file2.txt
提取正确的997,993行。
笔记
-
我testing了这个在我的Ubuntu 16.10笔记本电脑(英特尔酷睿i7-7500U CPU @ 2.70GHz)
-
整个基准研究可以在这里find 。
你有没有尝试Awk
来加速一些事情:
awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt
(或)使用Awk
index()
函数,如下面的Benjamin W.的build议
awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (index($0,i)) {print; break}}' file1.txt FS='|' file2.txt
(或) Ed Morton在评论中提出的更直接的正则expression式匹配,
awk 'FNR==NR{hash[$1]; next}{for (i in hash) if ($0~i) {print; break}}' file1.txt FS='|' file2.txt
是你所需要的。 我猜这将是更快,但不完全确定有百万条目的文件。 这里的问题是与沿线任何地方的可能性匹配。 如果在任何特定的专栏中(如单独说$2
),可以采用更快的方法
awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt
你也可以通过使用系统中设置的locale
来加快速度。 从这个美妙的StéphaneChazelas对这个问题的回答中解释 ,你可以通过设置传递地方LC_ALL=C
到本地运行的命令来加快速度。
在任何基于GNU
的系统上, locale
的默认值
$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
使用一个variablesLC_ALL
,您可以将所有LC_
variables一次设置为指定的语言环境
$ LC_ALL=C locale LANG=en_US.UTF-8 LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_PAPER="C" LC_NAME="C" LC_ADDRESS="C" LC_TELEPHONE="C" LC_MEASUREMENT="C" LC_IDENTIFICATION="C" LC_ALL=C
那么这是什么影响?
简单地说,在使用locale C
,它将默认为服务器的基本Unix / Linux语言ASCII
。 基本上,当你grep
东西,默认情况下,你的语言环境将被国际化,并设置为UTF-8
,它可以代表Unicode字符集中的每个字符,以帮助显示任何世界的写作系统,目前超过110,000
独特的字符,而对于ASCII
每个字符以单个字节序列编码,其字符集不超过128
唯一字符。
所以在使用以UTF-8
字符集编码的文件上使用grep
时,需要将每个字符与十万个唯一字符中的任何一个匹配,但在ASCII
只有128
ASCII
,所以使用你的fgrep
作为
LC_ALL=C fgrep -F -f file1.txt file2.txt
此外,同样的可以适应Awk
,因为它使用match($0,i)
调用regex
匹配,设置C
语言环境可以加快string匹配。
LC_ALL=C awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt
假设:1.你只需要在你的本地工作站上运行这个search。 2.您有多个核心/ cpus来利用并行search。
parallel --pipepart -a file2.txt --block 10M fgrep -F -f file1.txt
根据上下文进行一些调整:A.用LANG = C禁用NLS(这在另一个答案中已经提到)B.用-m标志设置最大匹配数。
注意:我猜file2是〜4GB,10M的块大小是可以的,但是你可能需要优化块的大小以获得最快的运行。
这个Perl脚本( a
)生成一个正则expression式模式:
#!/usr/bin/perl use strict; use warnings; use Regexp::Assemble qw( ); chomp( my @ids = <> ); my $ra = Regexp::Assemble->new(); $ra->add(quotemeta($_)) for @ids; print("^[^|]*\\|(?:" . (re::regexp_pattern($ra->re()))[0] . ")\\|");
以下是如何使用它:
$ LC_ALL=C grep -P "$( a file1.txt )" file2.txt date1|foo1|number1 date2|foo2|number2 date1|bar1|number1 date2|bar2|number2
注意脚本使用Regexp :: Assemble,所以你可能需要安装它。
sudo su cpan Regexp::Assemble
笔记:
-
与名为BOC1,BOC2,codeforester_orig,gregory1,inian2,inian4和oliv的解决scheme不同,我的解决scheme正确地处理
file1.txt foo1 file2.txt date1|foo12|number5
-
矿井应该比@BOC类似的解决scheme更好,因为该模式被优化以减less回溯。 (如果
file2.txt
有三个以上的字段,我也可以工作,而链接的解决scheme可能会失败。) -
我不知道它是如何比较拆分+词典解决scheme。
一小段Perl代码解决了这个问题。 这是采取的方法:
- 将文件
file1.txt
的行存储在一个散列中 -
file2.txt
读取file2.txt
,parsing并提取第二个字段 - 检查提取的字段是否在散列中; 如果是的话,打印行
这里是代码:
#!/usr/bin/perl -w use strict; if (scalar(@ARGV) != 2) { printf STDERR "Usage: fgrep.pl smallfile bigfile\n"; exit(2); } my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]); my ($small_fp, $big_fp, %small_hash, $field); open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!; open($big_fp, "<", $big_file) || die "Can't open $big_file: " . $!; # store contents of small file in a hash while (<$small_fp>) { chomp; $small_hash{$_} = undef; } close($small_fp); # loop through big file and find matches while (<$big_fp>) { # no need for chomp $field = (split(/\|/, $_))[1]; if (defined($field) && exists($small_hash{$field})) { printf("%s", $_); } } close($big_fp); exit(0);
我用file1.txt中的14K行和file2.txt中的1.3M行来运行上面的脚本。 它在13秒内完成,产生了126K的比赛。 下面是同样的time
输出:
real 0m11.694s user 0m11.507s sys 0m0.174s
我跑@Inian的awk
代码:
awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt
它比Perl解决scheme慢得多,因为它在file2.txt中的每一行循环14K次 – 这非常昂贵。 它在处理了file2.txt的592K条logging并产生了40K条匹配的行之后终止了。 以下是花了多长时间:
awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989 input record number 675280, file file2.txt source line number 1 real 55m5.539s user 54m53.080s sys 0m5.095s
使用@Inian的另一个awk
解决scheme,它消除了循环问题:
time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out real 0m39.966s user 0m37.916s sys 0m0.743s time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out real 0m41.057s user 0m38.475s sys 0m0.904s
awk
在这里是非常令人印象深刻的,因为我们不需要编写一个完整的程序来完成它。
我也运行了@ oliv的Python代码。 It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time
output:
real 895m14.862s user 806m59.219s sys 1m12.147s
I tried to follow the suggestion to use parallel . However, it failed with fgrep: memory exhausted
error, even with very small block sizes.
What surprised me was that fgrep
was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep
had an option to force the content of -f file
to be kept in a hash, just like what the Perl code did.
I didn't check join
approach – I didn't want the additional overhead of sorting the files. Also, given fgrep
's poor performance, I don't believe join
would have done better than the Perl code.
Thanks everyone for your attention and responses.
Here is a Python solution using sets — roughly equivalent to a Perl key only hash or awk array in concept.
#!/usr/bin/python import sys with open(sys.argv[1]) as f: tgt={e.rstrip() for e in f} with open(sys.argv[2]) as f: for line in f: cells=line.split("|") if cells[1] in tgt: print line.rstrip()
When I run this on files of similar size, it runs in about 8 seconds.
Same speed as:
$ awk 'FNR==NR{arr[$1]; next} $2 in arr{print $0}' FS="|" /tmp/f1 /tmp/f2
Both the Python and awk solution here are full string match only; not a partial regex style match.
Since the awk solution is fast and POSIX compliant, that is the better answer.
Here is Perl solution that uses Inline::C
to speed up the search for matching fields in the large file:
use strict; use warnings; use Inline C => './search.c'; my $smallfile = 'file1.txt'; my $bigfile = 'file2.txt'; open my $fh, '<', $smallfile or die "Can't open $smallfile: $!"; my %word = map { chomp; $_ => 1 } <$fh>; search( $bigfile, \%word );
The search()
sub routine is implemented in pure C using perlapi
to look up keys in the small file dictionary %words
:
search.c :
#include <stdio.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #define BLOCK_SIZE 8192 /* how much to read from file each time */ static char read_buf[BLOCK_SIZE + 1]; /* reads a block from file, returns -1 on error, 0 on EOF, else returns chars read, pointer to buf, and pointer to end of buf */ size_t read_block( int fd, char **ret_buf, char **end_buf ) { int ret; char *buf = read_buf; size_t len = BLOCK_SIZE; while (len != 0 && (ret = read(fd, buf, len)) != 0) { if (ret == -1) { if (errno == EINTR) continue; perror( "read" ); return ret; } len -= ret; buf += ret; } *end_buf = buf; *ret_buf = read_buf; return (size_t) (*end_buf - *ret_buf); } /* updates the line buffer with the char pointed to by cur, also updates cur */ int update_line_buffer( char **cur, char **line, size_t *llen, size_t max_line_len ) { if ( *llen > max_line_len ) { fprintf( stderr, "Too long line. Maximimum allowed line length is %ld\n", max_line_len ); return 0; } **line = **cur; (*line)++; (*llen)++; (*cur)++; return 1; } /* search for first pipe on a line (or next line if this is empty), assume line ptr points to beginning of line buffer. return 1 on success Return 0 if pipe could not be found for some reason, or if line buffer length was exceeded */ int search_field_start( int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len ) { char *line_start = *line; while (1) { if ( *cur >= *end_buf ) { size_t res = read_block( fd, cur, end_buf ); if (res <= 0) return 0; } if ( **cur == '|' ) break; /* Currently we just ignore malformed lines ( lines that do not have a pipe, and empty lines in the input */ if ( **cur == '\n' ) { *line = line_start; *llen = 0; (*cur)++; } else { if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0; } } return 1; } /* assume cur points at starting pipe of field return -1 on read error, return 0 if field len was too large for buffer or line buffer length exceed, else return 1 and field, and length of field */ int copy_field( int fd, char **cur, char **end_buf, char *field, size_t *flen, char **line, size_t *llen, size_t max_field_len, size_t max_line_len ) { *flen = 0; while( 1 ) { if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0; if ( *cur >= *end_buf ) { size_t res = read_block( fd, cur, end_buf ); if (res <= 0) return -1; } if ( **cur == '|' ) break; if ( *flen > max_field_len ) { printf( "Field width too large. Maximum allowed field width: %ld\n", max_field_len ); return 0; } *field++ = **cur; (*flen)++; } /* It is really not necessary to null-terminate the field since we return length of field and also field could contain internal null characters as well */ //*field = '\0'; return 1; } /* search to beginning of next line, return 0 on error, else return 1 */ int search_eol( int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len) { while (1) { if ( *cur >= *end_buf ) { size_t res = read_block( fd, cur, end_buf ); if (res <= 0) return 0; } if ( !update_line_buffer( cur, line, llen, max_line_len ) ) return 0; if ( *(*cur-1) == '\n' ) { break; } } //**line = '\0'; // not necessary return 1; } #define MAX_FIELD_LEN 80 /* max number of characters allowed in a field */ #define MAX_LINE_LEN 80 /* max number of characters allowed on a line */ /* Get next field ( ie field #2 on a line). Fields are separated by pipes '|' in the input file. Also get the line of the field. Return 0 on error, on success: Move internal pointer to beginning of next line return 1 and the field. */ size_t get_field_and_line_fast( int fd, char *field, size_t *flen, char *line, size_t *llen ) { static char *cur = NULL; static char *end_buf = NULL; size_t res; if (cur == NULL) { res = read_block( fd, &cur, &end_buf ); if ( res <= 0 ) return 0; } *llen = 0; if ( !search_field_start( fd, &cur, &end_buf, &line, llen, MAX_LINE_LEN )) return 0; if ( (res = copy_field( fd, &cur, &end_buf, field, flen, &line, llen, MAX_FIELD_LEN, MAX_LINE_LEN ) ) <= 0) return 0; if ( !search_eol( fd, &cur, &end_buf, &line, llen, MAX_LINE_LEN ) ) return 0; return 1; } void search( char *filename, SV *href) { if( !SvROK( href ) || ( SvTYPE( SvRV( href ) ) != SVt_PVHV ) ) { croak( "Not a hash reference" ); } int fd = open (filename, O_RDONLY); if (fd == -1) { croak( "Could not open file '%s'", filename ); } char field[MAX_FIELD_LEN+1]; char line[MAX_LINE_LEN+1]; size_t flen, llen; HV *hash = (HV *)SvRV( href ); while ( get_field_and_line_fast( fd, field, &flen, line, &llen ) ) { if( hv_exists( hash, field, flen ) ) fwrite( line, sizeof(char), llen, stdout); } if (close(fd) == -1) croak( "Close failed" ); }
Tests indicate that it is approximately 3 times faster than the fastest pure Perl solution (see method zdim2
in my other answer ) presented here.
A possible way is to use python
:
$ cat test.py import sys,re with open(sys.argv[1], "r") as f1: patterns = f1.read().splitlines() # read pattern from file1 without the trailing newline m = re.compile("|".join(patterns)) # create the regex with open(sys.argv[2], "r") as f2: for line in f2: if m.search(line) : print line, # print line from file2 if this one matches the regex
并像这样使用它:
python test.py file1.txt file2.txt
Can you give a try to join
? Files must be sorted though…
$ cat d.txt bar1 bar2 foo1 foo2 $ cat e.txt date1|bar1|number1 date2|bar2|number2 date3|bar3|number3 date1|foo1|number1 date2|foo2|number2 date3|foo3|number3 $ join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 d.txt e.txt date1|bar1|number1 date2|bar2|number2 date1|foo1|number1 date2|foo2|number2
Small Update:
By using LC_ALL=C in front of join, things are really speed up as can be seen in the benchmark of Håkon Hægland
PS1: I have my doubts if join can be faster than grep -f …
你也可以使用Perl来完成这个工作:
Please note that this will hog memory and your machine/server better has some.
Sample Data:
%_STATION@gaurav * /root/ga/pl> head file1.txt file2.txt ==> file1.txt <== foo1 foo2 ... bar1 bar2 ... ==> file2.txt <== date1|foo1|number1 date2|foo2|number2 date3|foo3|number3 ... date1|bar1|number1 date2|bar2|number2 date3|bar3|number3 %_STATION@gaurav * /root/ga/study/pl>
Script Output: Script will produce final output in a file named output_comp
.
%_STATION@gaurav * /root/ga/pl> ./comp.pl file1.txt file2.txt ; cat output_comp date1|bar1|number1 date2|bar2|number2 date2|foo2|number2 date1|foo1|number1 %_STATION@gaurav * /root/ga/pl>
脚本:
%_STATION@gaurav * /root/ga/pl> cat comp.pl #!/usr/bin/perl use strict ; use warnings ; use Data::Dumper ; my ($file1,$file2) = @ARGV ; my $output = "output_comp" ; my %hash ; # This will store main comparison data. my %tmp ; # This will store already selected results, to be skipped. (scalar @ARGV != 2 ? (print "Need 2 files!\n") : ()) ? exit 1 : () ; # Read all files at once and use their name as the key. for (@ARGV) { open FH, "<$_" or die "Cannot open $_\n" ; while (my $line = <FH>) {chomp $line ;$hash{$_}{$line} = "$line"} close FH ; } # Now we churn through the data and compare to generate # the sorted output in the output file. open FH, ">>$output" or die "Cannot open outfile!\n" ; foreach my $k1 (keys %{$hash{$file1}}){ foreach my $k2 (keys %{$hash{$file2}}){ if ($k1 =~ m/^.+?$k2.+?$/) { if (!defined $tmp{"$hash{$file2}{$k2}"}) { print FH "$hash{$file2}{$k2}\n" ; $tmp{"$hash{$file2}{$k2}"} = 1 ; } } } } close FH ; %_STATION@gaurav * /root/ga/pl>
谢谢。
IMHO, grep is a good tool highly optimised for huge file2.txt but maybe not for so many patterns to search. I suggest to combine all the strings of file1.txt into a single huge regexp like \|bar1|bar2|foo1|foo2\|
echo '\|'$(paste -s -d '|' file1.txt)'\|' > regexp1.txt grep -E -f regexp1.txt file2.txt > file.matched
And of course LANG=C may help. Please give feedback or send your files so I can test myself.
I would use SQLite3 🙂 Maybe in-memory database or whatever. Import the files and use SQL query.
Using flex :
1: build the flex processor:
$ awk 'NR==1{ printf "%%%%\n\n.*\\|(%s",$0 } { printf "|%s",$0 } END { print ")\\|.*\\n ECHO;\n.*\\n ;\n%%\n" }' file1.txt > a.fl
2: compile it
$ flex -Ca -F a.fl ; cc -O lex.yy.c -lfl
3: and run
$ a.out < file2.txt > out
Compiling (cc …) is a slow process; this approach will pay just for cases of stable file1.txt
(In my machine) The times taken to run a search "100 in 10_000_000" test in this approach is 3 times faster than LC_ALL=C fgrep...
Though this thread is over, but all grep-alike methods between two files are gathered in this post, why not to add this awk alternative, similar (or even improved) to the bounty winning Inian's awk solution:
awk 'NR==FNR{a[$0]=1;next}a[$2]' patterns.txt FS="|" datafile.txt >matches.txt # For matches restricted on Field2 of datafile
This is equivalent to Inian awk $2 in hash
solution but it could be even faster due to the fact that we don't ask awk to check if the whole hash array contains $2 of file2 – we just check if a[$2] has a value or not.
While reading the first patterns file appart from creating the hash array we assign also a value.
If $2
of datafile had been found before in patterns file, then a[$2]
would have a value and thus will be printed because is not null.
if a[$2]
of datafile returns no value(null) this is translated to false => no printing.
Extension to match any of the three fields of datafile:
awk 'NR==FNR{a[$0]=1;next}(a[$1] || a[$2] || a[$3])' patterns.txt FS="|" datafile.txt >matches.txt. #Printed if any of the three fields of datafile match pattern.
In both cases, applying LC_ALL=C in front of awk, seems to speed things up.
PS1: Offcourse this solution has also the pitfalls of all awk solutions. Is not a pattern matching. Is a direct / fixed matching between of the two files, like most of the solutions inhere.
PS2: In my poor machine benchmark using the small benchmark files of Håkon Hægland , i get about 20% better performance comparing to the awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt
setting language etc helps a little, perhaps.
otherwise I can not think of a magic solution to escape from your basic issue: the data is not structured, so you will have a search that comes down to the number of lines in file1 multiplied with the number of lines in file2.
put the billion lines in a database, and index it in a smart way, is the only speed up I can think of. that index would have to be very smart, though……
SImple solution is: have enough memory to fit everything in to. otherwise nothing much more you can do about this….