从Bash中另一个更大的文本文件中find文本文件的行的最快方法

我有两个文件， file1.txt和file2.txt 。 file1.txt约有14K行， file2.txt约有20亿。 file1.txt每行有一个字段f1 ，而file2.txt有3个字段，从f1到f3 ，由| 。

我想从file2.txt中find所有行，其中file1.txt f2与file2.txt f2匹配（或者如果我们不想花费额外的时间分割file2.txt的值，则file2.txt ）。

file1.txt（约14K行， 未sorting ）：

 foo1 foo2 ... bar1 bar2 ...

file2.txt（约20亿行， 未sorting ）：

 date1|foo1|number1 date2|foo2|number2 date3|foo3|number3 ... date1|bar1|number1 date2|bar2|number2 date3|bar3|number3

预期产出：

 date1|foo1|number1 date2|foo2|number2 date1|bar1|number1 date2|bar2|number2 ...

这是我已经尝试过，似乎需要几个小时才能运行：

 fgrep -F -f file1.txt file2.txt > file.matched

我想知道是否有一个更好，更快的方式来执行这个操作与普通的Unix命令或一个小脚本。

一个Perl解决scheme。 [请参阅下面的更新和注释。]

对第一个文件使用散列。当你逐行阅读大文件时，通过正则expression式提取字段（捕获||之间的第一个模式）或split （获取第二个字）并打印（如果exists 。他们可能在速度上有点不同（时间）。 defined检查在正则expression式中是不需要的，而对于split使用// （已定义）或短路。

 use warnings; use strict; # If 'prog smallfile bigfile' is the preferred use die "Usage: $0 smallfile bigfile" if @ARGV != 2; my ($smallfile, $bigfile) = @ARGV; open my $fh, '<', $smallfile or die "Can't open $smallfile: $!"; my %word = map { chomp; $_ => 1 } <$fh>; open $fh, '<', $bigfile or die "Can't open $bigfile: $!"; while (<$fh>) { exists $word{ (/\|([^|]+)/)[0] } && print; # Or #exists $word{ (split /\|/)[1] // '' } && print; } close $fh;

避免if分支和使用短路更快，但只有很less。在数十亿线上，这些调整加起来，但不要太多。它可能会（也可能不会）稍微快一点点读取小文件行，而不是在上面的列表上下文，但这不应该引人注目。

更新写入STDOUT保存了两个操作，我反复将其logging速度比写入文件快一点。这种用法也与大多数UNIX工具一致，所以我改为写入STDOUT 。接下来， existstesting是不需要的，并放弃它的操作。不过，我始终能够更好地运用它，同时也更好地传达了目的。总而言之，我要离开它了。感谢ikegami的意见。

注意注释掉的版本比其他版本快了50％ ，按照我的基准testing。这些都是因为他们不同而有所不同，一个find第一个匹配，另一个find第二个领域。我把它作为一个更普遍的select，因为这个问题是模棱两可的。

一些比较（基准）[更新为写入STDOUT ，请参阅上面的“更新”]

HåkonHægland的答案中有一个广泛的分析，大多数解决scheme的一次运行。下面是另一个需要考虑的问题，上面提到的两个解决scheme，OP的自己的答案，以及已经发布的fgrep ，预计会很快，并在问题和许多答案中使用。

我以下面的方式构buildtesting数据。一些大致如图所示的长度的行是用随机的字来表示的，这两个文件都是在第二个字段中匹配的。然后我用数据样本填充这个“种子”，用不匹配的行来模拟大小和OP所引用的匹配之间的比率：对于小文件中的14K行，大文件中有130万行，产生126K个匹配。然后重复写这些样本以构build完整的数据文件作为OP，每次使用List :: Util进行 shuffle 。

所有运行比较下面产生106_120匹配上述文件大小（ diff检查），所以匹配的频率是足够接近。他们通过使用my $res = timethese(60 ...)调用完整的程序来进行基准testing。在cmpthese($res)的结果是

        利率正则expression式cfor分裂fgrep
正则expression式1.05 / s  -  -23％-35％-44％
 cfor 1.36 / s 30％ -  -16％-28％
分1.62 /秒54％19％ -  -14％
 fgrep 1.89 / s 80％39％17％ -

事实上，优化的C编程fgrep出现在顶端并不奇怪。 “ 正则expression式 ”落后于“ 分裂 ”可能是由于多次启动小火柴引擎的开销。鉴于不断发展的正则expression式引擎优化，这可能会因Perl版本而有所不同。我声称自己的答案是最快的，其中包括@codeforester答案（“cfor”），其20%落后于非常类似的“ 分裂 ”可能是由于零散的小的低效率（请参阅下面的评论）。

这并不是完全不同的，而硬件和软件以及数据细节上的确有不同。我在不同的Perls和机器上运行它，显着的区别是在某些情况下， fgrep确实快了一个数量级 。

OP的fgrep非常慢的经验令人惊讶。考虑到他们引用的运行时间，比上面慢几个数量级，我猜测有一个旧的系统“责怪”。

尽pipe这是完全基于I / O的，但是将它放在多个内核上还是会带来并发的好处，我希望能有一个很好的加速，达到几个因素。

我试图对这里介绍的一些方法进行比较。

首先，我创build了一个Perl脚本来生成input文件file1.txt和file2.txt 。为了比较一些解决scheme，我确信file1.txt的单词只能出现在file2.txt的第二个字段中。另外为了能够使用@GeorgeVasiliou提供的联接解决scheme，我对file1.txt和file2.txt进行了sorting。目前我只基于75个随机单词（取自https://www.randomlists.com/random-words ）生成input文件。在file1.txt中只使用了这file2.txt个单词中的file2.txt ，其余的七十个单词用来填充file2.txt的字段。可能有必要大量增加单词的数量以获得切合实际的结果（根据OP，原文件file1.txt包含14000个单词）。在下面的testing中，我用了1000000（100万）行的file2.txt 。该脚本还生成regexp1.txt的grep解决scheme所需的文件regexp1.txt。

gen_input_files.pl ：

 #! /usr/bin/env perl use feature qw(say); use strict; use warnings; use Data::Printer; use Getopt::Long; GetOptions ("num_lines=i" => \my $nlines ) or die("Error in command line arguments\n"); # Generated random words from site: https://www.randomlists.com/random-words my $word_filename = 'words.txt'; # 75 random words my $num_match_words = 5; my $num_file2_lines = $nlines || 1_000_000; my $file2_words_per_line = 3; my $file2_match_field_no = 2; my $file1_filename = 'file1.txt'; my $file2_filename = 'file2.txt'; my $file1_regex_fn = 'regexp1.txt'; say "generating $num_file2_lines lines.."; my ( $words1, $words2 ) = get_words( $word_filename, $num_match_words ); write_file1( $file1_filename, $words2 ); write_file2( $file2_filename, $words1, $words2, $num_file2_lines, $file2_words_per_line, $file2_match_field_no ); write_BOC_regexp_file( $file1_regex_fn, $words2 ); sub write_BOC_regexp_file { my ( $fn, $words ) = @_; open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!"; print $fh '\\|' . (join "|", @$words) . '\\|'; close $fh; } sub write_file2 { my ( $fn, $words1, $words2, $nlines, $words_per_line, $field_no ) = @_; my $nwords1 = scalar @$words1; my $nwords2 = scalar @$words2; my @lines; for (1..$nlines) { my @words_line; my $key; for (1..$words_per_line) { my $word; if ( $_ != $field_no ) { my $index = int (rand $nwords1); $word = @{ $words1 }[$index]; } else { my $index = int (rand($nwords1 + $nwords2) ); if ( $index < $nwords2 ) { $word = @{ $words2 }[$index]; } else { $word = @{ $words1 }[$index - $nwords2]; } $key = $word; } push @words_line, $word; } push @lines, [$key, (join "|", @words_line)]; } @lines = map { $_->[1] } sort { $a->[0] cmp $b->[0] } @lines; open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!"; print $fh (join "\n", @lines); close $fh; } sub write_file1 { my ( $fn, $words ) = @_; open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!"; print $fh (join "\n", sort @$words); close $fh; } sub get_words { my ( $fn, $N ) = @_; open( my $fh, '<', $fn ) or die "Could not open file '$fn': $!"; my @words = map {chomp $_; $_} <$fh>; close $fh; my @words1 = @words[$N..$#words]; my @words2 = @words[0..($N - 1)]; return ( \@words1, \@words2 ); }

接下来，我创build了一个包含所有testing用例的子文件夹solutions ：

 $ tree solutions/ solutions/ ├── BOC1 │  ├── out.txt │  └── run.sh ├── BOC2 │  ├── out.txt │  └── run.sh ├── codeforester │  ├── out.txt │  ├── run.pl │  └── run.sh [...]

这里的文件out.txt是每个解决schemegreps的输出。脚本run.sh运行给定testing用例的解决scheme。

注意不同的解决scheme

BOC1 ：@BOC提出的第一个解决scheme
```
 grep -E -f regexp1.txt file2.txt 
```
BOC2 ：@BOCbuild议的第二个解决scheme：
```
 LC_ALL=C grep -E -f regexp1.txt file2.txt 
```
codeforester ：由@codeforester接受Perl解决scheme（请参阅源代码）
codeforester_orig ：由@codeforested提供的原始解决scheme：
```
 fgrep -f file1.txt file2.txt 
```
dawg ：使用@dawg提供的字典和分割线的Python解决scheme（参见源代码）
gregory1 ：使用@gregorybuild议的Gnu Parallel的解决scheme
```
 parallel -k --pipepart -a file2.txt --block "$block_size" fgrep -F -f file1.txt 
```
请参阅下面有关如何select$block_size注释。
hakon1 ：由@HåkonHægland提供的Perl解决scheme（请参阅参考资料）。此解决scheme需要在首次运行代码时编译c-extension。当file1.txt或file2.txt更改时，不需要重新编译。注意：下面列出的运行时间中不包括初始运行时用于编译c延伸的时间。
ikegami ：解决scheme使用汇编正则expression式和使用grep -P由@ikegami给出。注意：汇编正则expression式被写入到一个单独的文件regexp_ikegami.txt ，所以生成正则expression式的运行时间不包含在下面的比较中。这是使用的代码：
```
 regexp=$(< "regexp_ikegami.txt") grep -P "$regexp" file2.txt 
```

inian1 ：@Inian使用match()第一个解决scheme

 awk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (match($0,i)) {print; break} }' file1.txt FS='|' file2.txt

inian2 ：通过@Inian使用index()第二个解决scheme

 awk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (index($0,i)) {print; break} }' file1.txt FS='|' file2.txt

inian3 ：第三个解决scheme@Inian只检查$2字段：

 awk 'FNR==NR{ hash[$1]; next } $2 in hash' file1.txt FS='|' file2.txt

inian4 ：第四次使用@Inian（基本上和LC_ALL codeforester_orig一样）：
```
 LC_ALL=C fgrep -f file1.txt file2.txt 
```

inian5 ： inian5第五个解决scheme（与inian1相同，但是LC_ALL ）：

 LC_ALL=C awk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (match($0,i)) {print; break} }' file1.txt FS='|' file2.txt

inian6 ：与inian6相同，但LC_ALL=C 感谢@GeorgeVasiliou的build议。
jjoao ：编译由@JJaoao提供的flex生成的C代码（参见源代码）。注意：每次file1.txt更改时，必须重新编译exectuable。编译可执行文件的时间不包含在下面的运行时间中。
oliv：由@oliv提供的Python脚本（参见源代码）

Vasiliou ：按照@GeorgeVasiliou的build议使用join ：

 join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 file1.txt file2.txt

Vasiliou2 ：和Vasiliou一样，但LC_ALL=C
zdim ：使用zdim提供的Perl脚本（参见源代码）。注意：这使用正则expression式search版本（而不是分割线解决scheme）。
zdim2 ：与zdim2相同，除了它使用splitfunction而不是正则expression式searchfile2.txt的字段。

笔记

我尝试了一下Gnu并行（见上面的gregory1解决scheme），以确定我的CPU的最佳块大小。我有4个内核，而且目前看来，最佳select是将文件（ file2.txt ）分成4个相同大小的块，并在4个处理器中的每一个上运行一个作业。这里可能需要更多的testing。因此，对于file2.txt为20M的第一个testing用例，我将$block_size设置$block_size 5M（参见上面的gregory1解决scheme），而对于下面介绍的更为真实的情况，其中file2.txt是file2.txt ，使用67M的$block_size 。
解决schemeBOC1 ， BOC2 ， codeforester_orig ， inian1 ， inian4 ， inian5和gregory1都使用了松散匹配。这意味着file1.txt中的单词不必完全匹配file2.txt第2个file2.txt 。线上任何地方的比赛都被接受了。由于这种行为使得与其他方法相比更加困难，所以也引入了一些修改的方法。前两个名为BOC1B和BOC2B方法使用了修改的regexp1.txt文件。原始regexp1.txt中的regexp1.txt窗体\|foo1|foo2|...|fooN\| 这将匹配在任何领域的边界的话。修改过的文件regexp1b.txt使用格式^[^|]*\|foo1|foo2|...|fooN\|来匹配字段＃2 ^[^|]*\|foo1|foo2|...|fooN\| 代替。

然后，修改后的方法codeforester_origB ， inian1B ， inian4B ， inian5B和gregory1B使用修改的file1.txt 。修改后的文件file1b.txt在表单上每行使用一个正则expression式 ，而不是每行一个字面值 ：
```
  ^[^|]*\|word1\| ^[^|]*\|word2\| ^[^|]*\|word3\| [...] 
```
另外， fgrep -f被grep -E -freplace为这些方法。

运行testing

这是用于运行所有testing的脚本。它使用Bash time命令来logging每个脚本所花费的时间。请注意， time命令会返回三个不同的时间，分别称为real ， user和sys 。首先我使用了user + sys ，但是在使用Gnu并行命令时意识到这是不正确的，所以下面报告的时间现在是time返回的real部分。请参阅此问题以获取有关按时间返回的不同时间的更多信息。

第一个testing使用包含5行的file1.txt和包含1000000行的file2.txt运行。这是run_all.pl脚本的前52行，脚本的其余部分在这里可用。

run_all.pl

 #! /usr/bin/env perl use feature qw(say); use strict; use warnings; use Cwd; use Getopt::Long; use Data::Printer; use FGB::Common; use List::Util qw(max shuffle); use Number::Bytes::Human qw(format_bytes); use Sys::Info; GetOptions ( "verbose" => \my $verbose, "check" => \my $check, "single-case=s" => \my $case, "expected=i" => \my $expected_no_lines, ) or die("Error in command line arguments\n"); my $test_dir = 'solutions'; my $output_file = 'out.txt'; my $wc_expected = $expected_no_lines; # expected number of output lines my $tests = get_test_names( $test_dir, $case ); my $file2_size = get_file2_size(); my $num_cpus = Sys::Info->new()->device( CPU => () )->count; chdir $test_dir; my $cmd = 'run.sh'; my @times; for my $case (@$tests) { my $savedir = getcwd(); chdir $case; say "Running '$case'.."; my $arg = get_cmd_args( $case, $file2_size, $num_cpus ); my $output = `bash -c "{ time -p $cmd $arg; } 2>&1"`; my ($user, $sys, $real ) = get_run_times( $output ); print_timings( $user, $sys, $real ) if $verbose; check_output_is_ok( $output_file, $wc_expected, $verbose, $check ); print "\n" if $verbose; push @times, $real; #push @times, $user + $sys; # this is wrong when using Gnu parallel chdir $savedir; } say "Done.\n"; print_summary( $tests, \@times );

结果

以下是运行testing的输出：

 $ run_all.pl --verbose Running 'inian3'.. ..finished in 0.45 seconds ( user: 0.44, sys: 0.00 ) ..no of output lines: 66711 Running 'inian2'.. ..finished in 0.73 seconds ( user: 0.73, sys: 0.00 ) ..no of output lines: 66711 Running 'Vasiliou'.. ..finished in 0.09 seconds ( user: 0.08, sys: 0.00 ) ..no of output lines: 66711 Running 'codeforester_orig'.. ..finished in 0.05 seconds ( user: 0.05, sys: 0.00 ) ..no of output lines: 66711 Running 'codeforester'.. ..finished in 0.45 seconds ( user: 0.44, sys: 0.01 ) ..no of output lines: 66711 [...]

概要

[@Vasiliou获得的结果显示在中间栏。]

  |Vasiliou My Benchmark |Results | Details -------------------------------|---------|---------------------- inian4 : 0.04s |0.22s | LC_ALL fgrep -f [loose] codeforester_orig : 0.05s | | fgrep -f [loose] Vasiliou2 : 0.06s |0.16s | [LC_ALL join [requires sorted files]] BOC1 : 0.06s | | grep -E [loose] BOC2 : 0.07s |15s | LC_ALL grep -E [loose] BOC2B : 0.07s | | LC_ALL grep -E [strict] inian4B : 0.08s | | LC_ALL grep -E -f [strict] Vasiliou : 0.08s |0.23s | [join [requires sorted files]] gregory1B : 0.08s | | [parallel + grep -E -f [strict]] ikegami : 0.1s | | grep -P gregory1 : 0.11s |0.5s | [parallel + fgrep -f [loose]] hakon1 : 0.14s | | [perl + c] BOC1B : 0.14s | | grep -E [strict] jjoao : 0.21s | | [compiled flex generated c code] inian6 : 0.26s |0.7s | [LC_ALL awk + split+dict] codeforester_origB : 0.28s | | grep -E -f [strict] dawg : 0.35s | | [python + split+dict] inian3 : 0.44s |1.1s | [awk + split+dict] zdim2 : 0.4s | | [perl + split+dict] codeforester : 0.45s | | [perl + split+dict] oliv : 0.5s | | [python + compiled regex + re.search()] zdim : 0.61s | | [perl + regexp+dict] inian2 : 0.73s |1.7s | [awk + index($0,i)] inian5 : 18.12s | | [LC_ALL awk + match($0,i) [loose]] inian1 : 19.46s | | [awk + match($0,i) [loose]] inian5B : 42.27s | | [LC_ALL awk + match($0,i) [strict]] inian1B : 85.67s | | [awk + match($0,i) [strict]] Vasiliou Results : 2 X CPU Intel 2 Duo T6570 @ 2.10GHz - 2Gb RAM-Debian Testing 64bit- kernel 4.9.0.1 - no cpu freq scaling.

一个更现实的testing案例

然后，我创build了一个更具现实性的案例，其中包含100个字的file1.txt和1000万行（268Mb文件大小）的file2.txt。我用shuf -n1000 /usr/share/dict/american-english > words.txt从/usr/share/dict/american-english的字典中提取了1000个随机单词，然后将这100个单词提取到file1.txt ，然后构buildfile2.txt与第一个testing用例的描述相同。请注意，字典文件是UTF-8编码的，我从words.txt去除了所有非ASCII字符。

然后我运行testing，没有从前面的情况下的三个最慢的方法。即inian1 ， inian2和inian5被排除在外。以下是新的结果：

 gregory1 : 0.86s | [parallel + fgrep -f [loose]] Vasiliou2 : 0.94s | [LC_ALL join [requires sorted files]] inian4B : 1.12s | LC_ALL grep -E -f [strict] BOC2B : 1.13s | LC_ALL grep -E [strict] BOC2 : 1.15s | LC_ALL grep -E [loose] BOC1 : 1.18s | grep -E [loose] ikegami : 1.33s | grep -P Vasiliou : 1.37s | [join [requires sorted files]] hakon1 : 1.44s | [perl + c] inian4 : 2.18s | LC_ALL fgrep -f [loose] codeforester_orig : 2.2s | fgrep -f [loose] inian6 : 2.82s | [LC_ALL awk + split+dict] jjoao : 3.09s | [compiled flex generated c code] dawg : 3.54s | [python + split+dict] zdim2 : 4.21s | [perl + split+dict] codeforester : 4.67s | [perl + split+dict] inian3 : 5.52s | [awk + split+dict] zdim : 6.55s | [perl + regexp+dict] gregory1B : 45.36s | [parallel + grep -E -f [strict]] oliv : 60.35s | [python + compiled regex + re.search()] BOC1B : 74.71s | grep -E [strict] codeforester_origB : 75.52s | grep -E -f [strict]

注意

基于grep的解决scheme正在寻找整条线上的匹配，所以在这种情况下，它们包含一些错误匹配： codeforester_orig ， BOC1 ， BOC2 ， gregory1 ， inian4和oliv的方法从10,000,000行中提取1,087,609行，而其他方法从file2.txt提取正确的997,993行。

笔记

我testing了这个在我的Ubuntu 16.10笔记本电脑（英特尔酷睿i7-7500U CPU @ 2.70GHz）
整个基准研究可以在这里find 。

你有没有尝试Awk来加速一些事情：

 awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

（或）使用Awk index()函数，如下面的Benjamin W.的build议

 awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (index($0,i)) {print; break}}' file1.txt FS='|' file2.txt

（或） Ed Morton在评论中提出的更直接的正则expression式匹配，

 awk 'FNR==NR{hash[$1]; next}{for (i in hash) if ($0~i) {print; break}}' file1.txt FS='|' file2.txt

是你所需要的。我猜这将是更快，但不完全确定有百万条目的文件。这里的问题是与沿线任何地方的可能性匹配。如果在任何特定的专栏中（如单独说$2 ），可以采用更快的方法

 awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt

你也可以通过使用系统中设置的locale来加快速度。从这个美妙的StéphaneChazelas对这个问题的回答中解释，你可以通过设置传递地方LC_ALL=C到本地运行的命令来加快速度。

在任何基于GNU的系统上， locale的默认值

 $ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=

使用一个variablesLC_ALL ，您可以将所有LC_variables一次设置为指定的语言环境

 $ LC_ALL=C locale LANG=en_US.UTF-8 LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_PAPER="C" LC_NAME="C" LC_ADDRESS="C" LC_TELEPHONE="C" LC_MEASUREMENT="C" LC_IDENTIFICATION="C" LC_ALL=C

那么这是什么影响？

简单地说，在使用locale C ，它将默认为服务器的基本Unix / Linux语言ASCII 。基本上，当你grep东西，默认情况下，你的语言环境将被国际化，并设置为UTF-8 ，它可以代表Unicode字符集中的每个字符，以帮助显示任何世界的写作系统，目前超过110,000独特的字符，而对于ASCII每个字符以单个字节序列编码，其字符集不超过128唯一字符。

所以在使用以UTF-8字符集编码的文件上使用grep时，需要将每个字符与十万个唯一字符中的任何一个匹配，但在ASCII只有128 ASCII ，所以使用你的fgrep作为

 LC_ALL=C fgrep -F -f file1.txt file2.txt

此外，同样的可以适应Awk ，因为它使用match($0,i)调用regex匹配，设置C语言环境可以加快string匹配。

 LC_ALL=C awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

假设：1.你只需要在你的本地工作站上运行这个search。 2.您有多个核心/ cpus来利用并行search。

 parallel --pipepart -a file2.txt --block 10M fgrep -F -f file1.txt

根据上下文进行一些调整：A.用LANG = C禁用NLS（这在另一个答案中已经提到）B.用-m标志设置最大匹配数。

注意：我猜file2是〜4GB，10M的块大小是可以的，但是你可能需要优化块的大小以获得最快的运行。

这个Perl脚本（ a ）生成一个正则expression式模式：

 #!/usr/bin/perl use strict; use warnings; use Regexp::Assemble qw( ); chomp( my @ids = <> ); my $ra = Regexp::Assemble->new(); $ra->add(quotemeta($_)) for @ids; print("^[^|]*\\|(?:" . (re::regexp_pattern($ra->re()))[0] . ")\\|");

以下是如何使用它：

 $ LC_ALL=C grep -P "$( a file1.txt )" file2.txt date1|foo1|number1 date2|foo2|number2 date1|bar1|number1 date2|bar2|number2

注意脚本使用Regexp :: Assemble，所以你可能需要安装它。

 sudo su cpan Regexp::Assemble

笔记：

与名为BOC1，BOC2，codeforester_orig，gregory1，inian2，inian4和oliv的解决scheme不同，我的解决scheme正确地处理
```
 file1.txt foo1 file2.txt date1|foo12|number5 
```
矿井应该比@BOC类似的解决scheme更好，因为该模式被优化以减less回溯。（如果file2.txt有三个以上的字段，我也可以工作，而链接的解决scheme可能会失败。）
我不知道它是如何比较拆分+词典解决scheme。

一小段Perl代码解决了这个问题。这是采取的方法：

将文件file1.txt的行存储在一个散列中
file2.txt读取file2.txt ，parsing并提取第二个字段
检查提取的字段是否在散列中; 如果是的话，打印行

这里是代码：

 #!/usr/bin/perl -w use strict; if (scalar(@ARGV) != 2) { printf STDERR "Usage: fgrep.pl smallfile bigfile\n"; exit(2); } my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]); my ($small_fp, $big_fp, %small_hash, $field); open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!; open($big_fp, "<", $big_file) || die "Can't open $big_file: " . $!; # store contents of small file in a hash while (<$small_fp>) { chomp; $small_hash{$_} = undef; } close($small_fp); # loop through big file and find matches while (<$big_fp>) { # no need for chomp $field = (split(/\|/, $_))[1]; if (defined($field) && exists($small_hash{$field})) { printf("%s", $_); } } close($big_fp); exit(0);

我用file1.txt中的14K行和file2.txt中的1.3M行来运行上面的脚本。它在13秒内完成，产生了126K的比赛。下面是同样的time输出：

 real 0m11.694s user 0m11.507s sys 0m0.174s

我跑@Inian的awk代码：

 awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

它比Perl解决scheme慢得多，因为它在file2.txt中的每一行循环14K次 – 这非常昂贵。它在处理了file2.txt的592K条logging并产生了40K条匹配的行之后终止了。以下是花了多长时间：

 awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989 input record number 675280, file file2.txt source line number 1 real 55m5.539s user 54m53.080s sys 0m5.095s

使用@Inian的另一个awk解决scheme，它消除了循环问题：

 time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out real 0m39.966s user 0m37.916s sys 0m0.743s time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out real 0m41.057s user 0m38.475s sys 0m0.904s

awk在这里是非常令人印象深刻的，因为我们不需要编写一个完整的程序来完成它。

我也运行了@ oliv的Python代码。 It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time output:

 real 895m14.862s user 806m59.219s sys 1m12.147s

I tried to follow the suggestion to use parallel . However, it failed with fgrep: memory exhausted error, even with very small block sizes.

What surprised me was that fgrep was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep had an option to force the content of -f file to be kept in a hash, just like what the Perl code did.

I didn't check join approach – I didn't want the additional overhead of sorting the files. Also, given fgrep 's poor performance, I don't believe join would have done better than the Perl code.

Thanks everyone for your attention and responses.

Here is a Python solution using sets — roughly equivalent to a Perl key only hash or awk array in concept.

 #!/usr/bin/python import sys with open(sys.argv[1]) as f: tgt={e.rstrip() for e in f} with open(sys.argv[2]) as f: for line in f: cells=line.split("|") if cells[1] in tgt: print line.rstrip()

When I run this on files of similar size, it runs in about 8 seconds.

Same speed as:

 $ awk 'FNR==NR{arr[$1]; next} $2 in arr{print $0}' FS="|" /tmp/f1 /tmp/f2

Both the Python and awk solution here are full string match only; not a partial regex style match.

Since the awk solution is fast and POSIX compliant, that is the better answer.

Here is Perl solution that uses Inline::C to speed up the search for matching fields in the large file:

 use strict; use warnings; use Inline C => './search.c'; my $smallfile = 'file1.txt'; my $bigfile = 'file2.txt'; open my $fh, '<', $smallfile or die "Can't open $smallfile: $!"; my %word = map { chomp; $_ => 1 } <$fh>; search( $bigfile, \%word );

The search() sub routine is implemented in pure C using perlapi to look up keys in the small file dictionary %words :

search.c :

 #include <stdio.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #define BLOCK_SIZE 8192 /* how much to read from file each time */ static char read_buf[BLOCK_SIZE + 1]; /* reads a block from file, returns -1 on error, 0 on EOF, else returns chars read, pointer to buf, and pointer to end of buf */ size_t read_block( int fd, char **ret_buf, char **end_buf ) { int ret; char *buf = read_buf; size_t len = BLOCK_SIZE; while (len != 0 && (ret = read(fd, buf, len)) != 0) { if (ret == -1) { if (errno == EINTR) continue; perror( "read" ); return ret; } len -= ret; buf += ret; } *end_buf = buf; *ret_buf = read_buf; return (size_t) (*end_buf - *ret_buf); } /* updates the line buffer with the char pointed to by cur, also updates cur */ int update_line_buffer( char **cur, char **line, size_t *llen, size_t max_line_len ) { if ( *llen > max_line_len ) { fprintf( stderr, "Too long line. Maximimum allowed line length is %ld\n", max_line_len ); return 0; } **line = **cur; (*line)++; (*llen)++; (*cur)++; return 1; } /* search for first pipe on a line (or next line if this is empty), assume line ptr points to beginning of line buffer. return 1 on success Return 0 if pipe could not be found for some reason, or if line buffer length was exceeded */ int search_field_start( int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len ) { char *line_start = *line; while (1) { if ( *cur >= *end_buf ) { size_t res = read_block( fd, cur, end_buf ); if (res <= 0) return 0; } if ( **cur == '|' ) break; /* Currently we just ignore malformed lines ( lines that do not have a pipe, and empty lines in the input */ if ( **cur == '\n' ) { *line = line_start; *llen = 0; (*cur)++; } else { if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0; } } return 1; } /* assume cur points at starting pipe of field return -1 on read error, return 0 if field len was too large for buffer or line buffer length exceed, else return 1 and field, and length of field */ int copy_field( int fd, char **cur, char **end_buf, char *field, size_t *flen, char **line, size_t *llen, size_t max_field_len, size_t max_line_len ) { *flen = 0; while( 1 ) { if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0; if ( *cur >= *end_buf ) { size_t res = read_block( fd, cur, end_buf ); if (res <= 0) return -1; } if ( **cur == '|' ) break; if ( *flen > max_field_len ) { printf( "Field width too large. Maximum allowed field width: %ld\n", max_field_len ); return 0; } *field++ = **cur; (*flen)++; } /* It is really not necessary to null-terminate the field since we return length of field and also field could contain internal null characters as well */ //*field = '\0'; return 1; } /* search to beginning of next line, return 0 on error, else return 1 */ int search_eol( int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len) { while (1) { if ( *cur >= *end_buf ) { size_t res = read_block( fd, cur, end_buf ); if (res <= 0) return 0; } if ( !update_line_buffer( cur, line, llen, max_line_len ) ) return 0; if ( *(*cur-1) == '\n' ) { break; } } //**line = '\0'; // not necessary return 1; } #define MAX_FIELD_LEN 80 /* max number of characters allowed in a field */ #define MAX_LINE_LEN 80 /* max number of characters allowed on a line */ /* Get next field ( ie field #2 on a line). Fields are separated by pipes '|' in the input file. Also get the line of the field. Return 0 on error, on success: Move internal pointer to beginning of next line return 1 and the field. */ size_t get_field_and_line_fast( int fd, char *field, size_t *flen, char *line, size_t *llen ) { static char *cur = NULL; static char *end_buf = NULL; size_t res; if (cur == NULL) { res = read_block( fd, &cur, &end_buf ); if ( res <= 0 ) return 0; } *llen = 0; if ( !search_field_start( fd, &cur, &end_buf, &line, llen, MAX_LINE_LEN )) return 0; if ( (res = copy_field( fd, &cur, &end_buf, field, flen, &line, llen, MAX_FIELD_LEN, MAX_LINE_LEN ) ) <= 0) return 0; if ( !search_eol( fd, &cur, &end_buf, &line, llen, MAX_LINE_LEN ) ) return 0; return 1; } void search( char *filename, SV *href) { if( !SvROK( href ) || ( SvTYPE( SvRV( href ) ) != SVt_PVHV ) ) { croak( "Not a hash reference" ); } int fd = open (filename, O_RDONLY); if (fd == -1) { croak( "Could not open file '%s'", filename ); } char field[MAX_FIELD_LEN+1]; char line[MAX_LINE_LEN+1]; size_t flen, llen; HV *hash = (HV *)SvRV( href ); while ( get_field_and_line_fast( fd, field, &flen, line, &llen ) ) { if( hv_exists( hash, field, flen ) ) fwrite( line, sizeof(char), llen, stdout); } if (close(fd) == -1) croak( "Close failed" ); }

Tests indicate that it is approximately 3 times faster than the fastest pure Perl solution (see method zdim2 in my other answer ) presented here.

A possible way is to use python :

 $ cat test.py import sys,re with open(sys.argv[1], "r") as f1: patterns = f1.read().splitlines() # read pattern from file1 without the trailing newline m = re.compile("|".join(patterns)) # create the regex with open(sys.argv[2], "r") as f2: for line in f2: if m.search(line) : print line, # print line from file2 if this one matches the regex

并像这样使用它：

 python test.py file1.txt file2.txt

Can you give a try to join ? Files must be sorted though…

 $ cat d.txt bar1 bar2 foo1 foo2 $ cat e.txt date1|bar1|number1 date2|bar2|number2 date3|bar3|number3 date1|foo1|number1 date2|foo2|number2 date3|foo3|number3 $ join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 d.txt e.txt date1|bar1|number1 date2|bar2|number2 date1|foo1|number1 date2|foo2|number2

Small Update:
By using LC_ALL=C in front of join, things are really speed up as can be seen in the benchmark of Håkon Hægland

PS1: I have my doubts if join can be faster than grep -f …

你也可以使用Perl来完成这个工作：

Please note that this will hog memory and your machine/server better has some.

Sample Data:

 %_STATION@gaurav * /root/ga/pl> head file1.txt file2.txt ==> file1.txt <== foo1 foo2 ... bar1 bar2 ... ==> file2.txt <== date1|foo1|number1 date2|foo2|number2 date3|foo3|number3 ... date1|bar1|number1 date2|bar2|number2 date3|bar3|number3 %_STATION@gaurav * /root/ga/study/pl>

Script Output: Script will produce final output in a file named output_comp .

 %_STATION@gaurav * /root/ga/pl> ./comp.pl file1.txt file2.txt ; cat output_comp date1|bar1|number1 date2|bar2|number2 date2|foo2|number2 date1|foo1|number1 %_STATION@gaurav * /root/ga/pl>

脚本：

 %_STATION@gaurav * /root/ga/pl> cat comp.pl #!/usr/bin/perl use strict ; use warnings ; use Data::Dumper ; my ($file1,$file2) = @ARGV ; my $output = "output_comp" ; my %hash ; # This will store main comparison data. my %tmp ; # This will store already selected results, to be skipped. (scalar @ARGV != 2 ? (print "Need 2 files!\n") : ()) ? exit 1 : () ; # Read all files at once and use their name as the key. for (@ARGV) { open FH, "<$_" or die "Cannot open $_\n" ; while (my $line = <FH>) {chomp $line ;$hash{$_}{$line} = "$line"} close FH ; } # Now we churn through the data and compare to generate # the sorted output in the output file. open FH, ">>$output" or die "Cannot open outfile!\n" ; foreach my $k1 (keys %{$hash{$file1}}){ foreach my $k2 (keys %{$hash{$file2}}){ if ($k1 =~ m/^.+?$k2.+?$/) { if (!defined $tmp{"$hash{$file2}{$k2}"}) { print FH "$hash{$file2}{$k2}\n" ; $tmp{"$hash{$file2}{$k2}"} = 1 ; } } } } close FH ; %_STATION@gaurav * /root/ga/pl>

谢谢。

IMHO, grep is a good tool highly optimised for huge file2.txt but maybe not for so many patterns to search. I suggest to combine all the strings of file1.txt into a single huge regexp like \|bar1|bar2|foo1|foo2\|

 echo '\|'$(paste -s -d '|' file1.txt)'\|' > regexp1.txt grep -E -f regexp1.txt file2.txt > file.matched

And of course LANG=C may help. Please give feedback or send your files so I can test myself.

I would use SQLite3 🙂 Maybe in-memory database or whatever. Import the files and use SQL query.

Using flex :

1: build the flex processor:

 $ awk 'NR==1{ printf "%%%%\n\n.*\\|(%s",$0 } { printf "|%s",$0 } END { print ")\\|.*\\n ECHO;\n.*\\n ;\n%%\n" }' file1.txt > a.fl

2: compile it

 $ flex -Ca -F a.fl ; cc -O lex.yy.c -lfl

3: and run

 $ a.out < file2.txt > out

Compiling (cc …) is a slow process; this approach will pay just for cases of stable file1.txt

(In my machine) The times taken to run a search "100 in 10_000_000" test in this approach is 3 times faster than LC_ALL=C fgrep...

Though this thread is over, but all grep-alike methods between two files are gathered in this post, why not to add this awk alternative, similar (or even improved) to the bounty winning Inian's awk solution:

 awk 'NR==FNR{a[$0]=1;next}a[$2]' patterns.txt FS="|" datafile.txt >matches.txt # For matches restricted on Field2 of datafile

This is equivalent to Inian awk $2 in hash solution but it could be even faster due to the fact that we don't ask awk to check if the whole hash array contains $2 of file2 – we just check if a[$2] has a value or not.

While reading the first patterns file appart from creating the hash array we assign also a value.

If $2 of datafile had been found before in patterns file, then a[$2] would have a value and thus will be printed because is not null.

if a[$2] of datafile returns no value(null) this is translated to false => no printing.

Extension to match any of the three fields of datafile:

 awk 'NR==FNR{a[$0]=1;next}(a[$1] || a[$2] || a[$3])' patterns.txt FS="|" datafile.txt >matches.txt. #Printed if any of the three fields of datafile match pattern.

In both cases, applying LC_ALL=C in front of awk, seems to speed things up.

PS1: Offcourse this solution has also the pitfalls of all awk solutions. Is not a pattern matching. Is a direct / fixed matching between of the two files, like most of the solutions inhere.

PS2: In my poor machine benchmark using the small benchmark files of Håkon Hægland , i get about 20% better performance comparing to the awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt

setting language etc helps a little, perhaps.

otherwise I can not think of a magic solution to escape from your basic issue: the data is not structured, so you will have a search that comes down to the number of lines in file1 multiplied with the number of lines in file2.

put the billion lines in a database, and index it in a smart way, is the only speed up I can think of. that index would have to be very smart, though……

SImple solution is: have enough memory to fit everything in to. otherwise nothing much more you can do about this….

从Bash中另一个更大的文本文件中find文本文件的行的最快方法

注意不同的解决scheme

笔记

运行testing

结果

概要

一个更现实的testing案例

注意

笔记

1: build the flex processor:

2: compile it

3: and run

awk打印列$ 3如果$ 2 ==一个特定的值？

使用AWK处理来自多个文件的input

什么是awk中的“NR == FNR”？

如何在awk脚本中使用shellvariables？

在bash中转换date格式

用逗号分隔多行代码（Perl / Sed / AWK）

按行长度（包括空格）sorting文本文件

从文件中随机选取行而不用Unix啜食它

awk：从行模式访问捕获的组

如何删除unix中每行末尾的字符