在python大量stringreplace？

说我有一个string，看起来像这样：

str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"

您会注意到string中存在和号的位置很多，后面跟着一个字符（例如“＆y”和“＆c”）。我需要用字典中的适当值replace这些字符，如下所示：

 dict = {"&y":"\033[0;30m", "&c":"\033[0;31m", "&b":"\033[0;32m", "&Y":"\033[0;33m", "&u":"\033[0;34m"}

什么是最快的方法来做到这一点？我可以手动查找所有＆符号，然后循环查看字典来更改它们，但是这似乎很慢。做一堆正则expression式replace似乎也很慢（我将有一个约30-40对在我的实际代码字典）。

任何build议表示赞赏，谢谢。

编辑：

正如在这个问题的评论中已经指出的那样，我的字典是在运行时定义的，在应用程序生命周期中决不会改变。这是一个ANSI转义序列的列表，并将有大约40个项目在其中。我的平均string长度将比较大约500个字符，但会有一些是5000个字符（虽然，这些将是罕见的）。我目前也在使用Python 2.6。

编辑＃2我接受Tor Valamos的答案是正确的，因为它不仅提供了一个有效的解决scheme（尽pipe这不是最好的解决scheme），而是考虑到所有其他方面，并做了大量的工作来比较所有的问题。这个答案是我在StackOverflow中遇到的最好，最有帮助的答案之一。荣誉给你。

 mydict = {"&y":"\033[0;30m", "&c":"\033[0;31m", "&b":"\033[0;32m", "&Y":"\033[0;33m", "&u":"\033[0;34m"} mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog" for k, v in mydict.iteritems(): mystr = mystr.replace(k, v) print mystr The ←[0;30mquick ←[0;31mbrown ←[0;32mfox ←[0;33mjumps over the ←[0;34mlazy dog

我冒昧地比较了几个解决scheme：

 mydict = dict([('&' + chr(i), str(i)) for i in list(range(65, 91)) + list(range(97, 123))]) # random inserts between keys from random import randint rawstr = ''.join(mydict.keys()) mystr = '' for i in range(0, len(rawstr), 2): mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars from time import time # How many times to run each solution rep = 10000 print 'Running %d times with string length %d and ' \ 'random inserts of lengths 0-20' % (rep, len(mystr)) # My solution t = time() for x in range(rep): for k, v in mydict.items(): mystr.replace(k, v) #print(mystr) print '%-30s' % 'Tor fixed & variable dict', time()-t from re import sub, compile, escape # Peter Hansen t = time() for x in range(rep): sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict print '%-30s' % 'Peter fixed & variable dict', time()-t # Claudiu def multiple_replace(dict, text): # Create a regular expression from the dictionary keys regex = compile("(%s)" % "|".join(map(escape, dict.keys()))) # For each match, look-up corresponding value in dictionary return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) t = time() for x in range(rep): multiple_replace(mydict, mystr) print '%-30s' % 'Claudio variable dict', time()-t # Claudiu - Precompiled regex = compile("(%s)" % "|".join(map(escape, mydict.keys()))) t = time() for x in range(rep): regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr) print '%-30s' % 'Claudio fixed dict', time()-t # Andrew Y - variable dict def mysubst(somestr, somedict): subs = somestr.split("&") return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:])) t = time() for x in range(rep): mysubst(mystr, mydict) print '%-30s' % 'Andrew Y variable dict', time()-t # Andrew Y - fixed def repl(s): return mydict["&"+s[0:1]] + s[1:] t = time() for x in range(rep): subs = mystr.split("&") res = subs[0] + "".join(map(repl, subs[1:])) print '%-30s' % 'Andrew Y fixed dict', time()-t

结果在Python 2.6中

 Running 10000 times with string length 490 and random inserts of lengths 0-20 Tor fixed & variable dict 1.04699993134 Peter fixed & variable dict 0.218999862671 Claudio variable dict 2.48400020599 Claudio fixed dict 0.0940001010895 Andrew Y variable dict 0.0309998989105 Andrew Y fixed dict 0.0310001373291

claudiu和安德鲁的解决scheme都保持为0，所以我不得不增加到10 000次。

我运行它在Python 3 （因为unicode）replace字符从39到1024（38是＆符号，所以我不想包括它）。 string长度可达10.000，包括长度为0-20的variables随机插入的约980个replace。从39到1024的unicode值会导致长度为1和2个字节的字符，这可能会影响某些解决scheme。

 mydict = dict([('&' + chr(i), str(i)) for i in range(39,1024)]) # random inserts between keys from random import randint rawstr = ''.join(mydict.keys()) mystr = '' for i in range(0, len(rawstr), 2): mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars from time import time # How many times to run each solution rep = 10000 print('Running %d times with string length %d and ' \ 'random inserts of lengths 0-20' % (rep, len(mystr))) # Tor Valamo - too long #t = time() #for x in range(rep): # for k, v in mydict.items(): # mystr.replace(k, v) #print('%-30s' % 'Tor fixed & variable dict', time()-t) from re import sub, compile, escape # Peter Hansen t = time() for x in range(rep): sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict print('%-30s' % 'Peter fixed & variable dict', time()-t) # Peter 2 def dictsub(m): return mydict[m.group()] t = time() for x in range(rep): sub(r'(&[a-zA-Z])', dictsub, mystr) print('%-30s' % 'Peter fixed dict', time()-t) # Claudiu - too long #def multiple_replace(dict, text): # # Create a regular expression from the dictionary keys # regex = compile("(%s)" % "|".join(map(escape, dict.keys()))) # # # For each match, look-up corresponding value in dictionary # return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) # #t = time() #for x in range(rep): # multiple_replace(mydict, mystr) #print('%-30s' % 'Claudio variable dict', time()-t) # Claudiu - Precompiled regex = compile("(%s)" % "|".join(map(escape, mydict.keys()))) t = time() for x in range(rep): regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr) print('%-30s' % 'Claudio fixed dict', time()-t) # Separate setup for Andrew and gnibbler optimized dict mydict = dict((k[1], v) for k, v in mydict.items()) # Andrew Y - variable dict def mysubst(somestr, somedict): subs = somestr.split("&") return subs[0] + "".join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:])) def mysubst2(somestr, somedict): subs = somestr.split("&") return subs[0].join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:])) t = time() for x in range(rep): mysubst(mystr, mydict) print('%-30s' % 'Andrew Y variable dict', time()-t) t = time() for x in range(rep): mysubst2(mystr, mydict) print('%-30s' % 'Andrew Y variable dict 2', time()-t) # Andrew Y - fixed def repl(s): return mydict[s[0:1]] + s[1:] t = time() for x in range(rep): subs = mystr.split("&") res = subs[0] + "".join(map(repl, subs[1:])) print('%-30s' % 'Andrew Y fixed dict', time()-t) # gnibbler t = time() for x in range(rep): myparts = mystr.split("&") myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]] "".join(myparts) print('%-30s' % 'gnibbler fixed & variable dict', time()-t)

结果：

 Running 10000 times with string length 9491 and random inserts of lengths 0-20 Tor fixed & variable dict 0.0 # disqualified 329 secs Peter fixed & variable dict 2.07799983025 Peter fixed dict 1.53100013733 Claudio variable dict 0.0 # disqualified, 37 secs Claudio fixed dict 1.5 Andrew Y variable dict 0.578000068665 Andrew Y variable dict 2 0.56299996376 Andrew Y fixed dict 0.56200003624 gnibbler fixed & variable dict 0.530999898911

（**注意，gnibbler的代码使用不同的字典，其中键没有包含'＆'，Andrew的代码也使用这个替代字典，但没有什么区别，可能只是加速的0.01倍。

试试这个，使用正则expression式replace和标准的string格式：

 # using your stated values for str and dict: >>> import re >>> str = re.sub(r'(&[a-zA-Z])', r'%(\1)s', str) >>> str % dict 'The \x1b[0;30mquick \x1b[0;31mbrown \x1b[0;32mfox \x1b[0;33mjumps over the \x1b[0;34mlazy dog'

re.sub（）调用replace所有的＆符号序列，后面跟着包含相同模式的模式％（..）s的单个字母。

％格式化利用了string格式化function，可以使字典指定replace，而不是更常见的位置参数。

另一种方法可以直接在re.sub中使用callback来实现：

 >>> import re >>> def dictsub(m): >>> return dict[m.group()] >>> str = re.sub(r'(&[a-zA-Z])', dictsub, str)

这一次我使用闭包从callback函数内引用字典。这种方法可以给你更多的灵活性。例如，你可以使用类似于dict.get(m.group(), '??')以避免引发exception，如果你有string无法识别的代码序列。

（顺便说一句，“dict”和“str”都是内置的函数，如果你在自己的代码中使用这些名字，你会遇到麻烦，以防万一你不知道。这个问题当然是这样的。）

编辑：我决定检查Tor的testing代码，并得出结论说，这是远远不具有代表性，实际上越野车。生成的string甚至没有和号（！）。下面的修改后的代码生成一个代表性的字典和string，类似于OP的示例input。

我也想validation每个algorithm的输出是相同的。下面是一个修改后的testing程序，只有Tor的，我的和Claudiu的代码 – 因为其他人打破了样本input。（我认为它们都很脆弱，除非字典映射基本上所有可能的＆符号序列，Tor的testing代码是这样做的。）这个正确的种子随机数生成器，所以每个运行是相同的。最后，我添加了一个小的变化，使用一个生成器，避免了一些函数调用的开销，为了一个小的性能改进。

 from time import time import string import random import re random.seed(1919096) # ensure consistent runs # build dictionary with 40 mappings, representative of original question mydict = dict(('&' + random.choice(string.letters), '\x1b[0;%sm' % (30+i)) for i in range(40)) # build simulated input, with mix of text, spaces, ampersands in reasonable proportions letters = string.letters + ' ' * 12 + '&' * 6 mystr = ''.join(random.choice(letters) for i in range(1000)) # How many times to run each solution rep = 10000 print('Running %d times with string length %d and %d ampersands' % (rep, len(mystr), mystr.count('&'))) # Tor Valamo # fixed from Tor's test, so it actually builds up the final string properly t = time() for x in range(rep): output = mystr for k, v in mydict.items(): output = output.replace(k, v) print('%-30s' % 'Tor fixed & variable dict', time() - t) # capture "known good" output as expected, to verify others expected = output # Peter Hansen # build charset to use in regex for safe dict lookup charset = ''.join(x[1] for x in mydict.keys()) # grab reference to method on regex, for speed patsub = re.compile(r'(&[%s])' % charset).sub t = time() for x in range(rep): output = patsub(r'%(\1)s', mystr) % mydict print('%-30s' % 'Peter fixed & variable dict', time()-t) assert output == expected # Peter 2 def dictsub(m): return mydict[m.group()] t = time() for x in range(rep): output = patsub(dictsub, mystr) print('%-30s' % 'Peter fixed dict', time() - t) assert output == expected # Peter 3 - freaky generator version, to avoid function call overhead def dictsub(d): m = yield None while 1: m = yield d[m.group()] dictsub = dictsub(mydict).send dictsub(None) # "prime" it t = time() for x in range(rep): output = patsub(dictsub, mystr) print('%-30s' % 'Peter generator', time() - t) assert output == expected # Claudiu - Precompiled regex_sub = re.compile("(%s)" % "|".join(mydict.keys())).sub t = time() for x in range(rep): output = regex_sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr) print('%-30s' % 'Claudio fixed dict', time() - t) assert output == expected

我之前忘记了包含基准testing结果：

    运行10000次，string长度为1000，96个符号
     （'Tor fixed＆variable dict'，2.9890000820159912）
     （'Peter fixed＆variable dict'，2.6659998893737793）
     （'Peter fixed dict'，1.0920000076293945）
     （'彼得发电机'，1.0460000038146973）
     （'Claudio fixed dict'，1.562000036239624）

另外，input和正确的输出片段：

 mystr = 'lTEQDMAPvksk k&z Txp vrnhQ GHaO&GNFY&&a...' mydict = {'&p': '\x1b[0;37m', '&q': '\x1b[0;66m', '&v': ...} output = 'lTEQDMAPvksk k←[0;57m Txp vrnhQ GHaO←[0;67mNFY&&a P...'

与我从Tor的testing代码输出中看到的结果比较：

 mystr = 'VVVVVVVPPPPPPPPPPPPPPPXXXXXXXXYYYFFFFFFFFFFFFEEEEEEEEEEE...' mydict = {'&p': '112', '&q': '113', '&r': '114', '&s': '115', ...} output = # same as mystr since there were no ampersands inside

如果你真的想深入研究这个话题，请看这个： http : //en.wikipedia.org/wiki/Aho-Corasick_algorithm

通过迭代字典并replacestring中的每个元素，明显的解决scheme需要O(n*m)时间，其中n是字典的大小，m是string的长度。

而Aho-Corasick-Algorithm在O(n+m+f)中查找字典的所有条目，其中f是find的元素的数量。

如果列表中的键的数量很大，并且string中的出现次数很less（大多数为零），则可以遍历string中＆符号的出现次数，并使用第一个string键入的字典子串的特征。我不经常在python中编写代码，所以风格可能有点偏离，但是这里是我的看法：

 str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog" dict = {"&y":"\033[0;30m", "&c":"\033[0;31m", "&b":"\033[0;32m", "&Y":"\033[0;33m", "&u":"\033[0;34m"} def rep(s): return dict["&"+s[0:1]] + s[1:] subs = str.split("&") res = subs[0] + "".join(map(rep, subs[1:])) print res

当然，有一个问题，当string本身有一个＆符号时，会发生什么情况？在完成这个过程之前，您需要以某种方式逃避它，然后在这个过程之后逃避。

当然，就性能问题而言，通常情况下，对典型（也是最坏情况）数据集中的各种方法进行计时并对其进行比较是一件好事。

编辑：把它放到一个单独的函数来处理任意字典：

 def mysubst(somestr, somedict): subs = somestr.split("&") return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))

编辑2：摆脱不必要的连接，在许多迭代中似乎仍比以前快一点。

 def mysubst(somestr, somedict): subs = somestr.split("&") return subs[0].join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))

这里是Python的C扩展方法

 const char *dvals[]={ //"0-64 "","","","","","","","","","", "","","","","","","","","","", "","","","","","","","","","", "","","","","","","","","","", "","","","","","","","","","", "","","","","","","","","","", "","","","","", //AZ "","","","","", "","","","","", "","","","","", "","","","","", "","","","","33", "", // "","","","","","", //az "","32","31","","", "","","","","", "","","","","", "","","","","", "34","","","","30", "" }; int dsub(char*d,char*s){ char *ofs=d; do{ if(*s=='&' && s[1]<='z' && *dvals[s[1]]){ //\033[0; *d++='\\',*d++='0',*d++='3',*d++='3',*d++='[',*d++='0',*d++=';'; //consider as fixed 2 digits *d++=dvals[s[1]][0]; *d++=dvals[s[1]][1]; *d++='m'; s++; //skip //non &,invalid, unused (&) ampersand sequences will go here. }else *d++=*s; }while(*s++); return d-ofs-1; }

我testing过的Python代码

 from mylib import * import time start=time.time() instr="The &yquick &cbrown &bfox &Yjumps over the &ulazy dog, skip &Unknown.\n"*100000 x=dsub(instr) end=time.time() print "time taken",end-start,",input str length",len(x) print "first few lines" print x[:1100]

结果

 time taken 0.140000104904 ,input str length 11000000 first few lines The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.

它假设能够在O（n）上运行，并且在My Mobile Celeron 1.6 GHz PC上只花费了160 ms（avg）来处理11 MB的string

它也会跳过未知字符，例如&Unknown将按原样返回

让我知道如果你有任何编译，错误等问题…

这似乎是你想做的 – 多个string一次使用RegExpsreplace。这是相关的代码：

 def multiple_replace(dict, text): # Create a regular expression from the dictionary keys regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys()))) # For each match, look-up corresponding value in dictionary return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) print multiple_replace(dict, str)

定义replace规则的一般解决scheme是使用正则expression式replace使用函数来提供映射（请参阅re.sub（））。

 import re str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog" dict = {"&y":"\033[0;30m", "&c":"\033[0;31m", "&b":"\033[0;32m", "&Y":"\033[0;33m", "&u":"\033[0;34m"} def programmaticReplacement( match ): return dict[ match.group( 1 ) ] colorstring = re.sub( '(\&.)', programmaticReplacement, str )

这对于不重要的替代（例如任何需要math运算来创build替代品）来说都是非常好的。

这是一个使用split / join的版本

 mydict = {"y":"\033[0;30m", "c":"\033[0;31m", "b":"\033[0;32m", "Y":"\033[0;33m", "u":"\033[0;34m"} mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog" myparts = mystr.split("&") myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]] print "".join(myparts)

如果有和代码无效的代码，你可以用它来保存它们

 myparts[1:]=[mydict.get(x[0],"&"+x[0])+x[1:] for x in myparts[1:]]

彼得·汉森（Peter Hansen）指出，当双重＆符号的时候，这个失败了。在这种情况下使用这个版本

 mystr = "The &yquick &cbrown &bfox &Yjumps over the &&ulazy dog" myparts = mystr.split("&") myparts[1:]=[mydict.get(x[:1],"&"+x[:1])+x[1:] for x in myparts[1:]] print "".join(myparts)

不知道这个解决scheme的速度，但你可以循环你的字典，并反复调用内置

str.replace(old, new)

如果原始string不太长，这可能会performance得相当好，但是当string变长时显然会受到影响。

在Python中进行大量replace的问题是string的不可变性：每次您将replacestring中的一个项目，然后整个新的string将一次又一次地从堆中重新分配。

所以，如果你想要最快的解决scheme，你需要使用可变容器（例如列表），或者在普通的C（或更好的Pyrex或Cython）中写这个机器。在任何情况下，我build议基于简单的有限状态机编写简单的parsing器，然后逐个inputstring的符号。

build议的解决scheme基于正则expression式以类似的方式工作，因为regexp在场景后面使用fsm。

由于有人提到使用一个简单的parsing器，我想我会做一个使用pyparsing。通过使用pyparsing的transformString方法，pyparsing在内部扫描源string，并构build匹配的文本和中介文本的列表。当所有完成时，transformString然后''.join的这个列表，所以没有性能问题，build立string的增量。（为ANSIreplacer定义的parsing操作将匹配的＆_字符转换为所需的转义序列，并将匹配的文本replace为parsing操作的输出。因为只有匹配的序列才能满足parsing器expression式，所以不需要parsing操作来处理未定义的＆_序列。）

FollowedBy（'＆'）并不是绝对必要的，但它通过validationparsing器实际上是否位于＆符之前，对所有标记选项进行更昂贵的检查之前，快捷地parsing了parsing过程。

 from pyparsing import FollowedBy, oneOf escLookup = {"&y":"\033[0;30m", "&c":"\033[0;31m", "&b":"\033[0;32m", "&Y":"\033[0;33m", "&u":"\033[0;34m"} # make a single expression that will look for a leading '&', then try to # match each of the escape expressions ANSIreplacer = FollowedBy('&') + oneOf(escLookup.keys()) # add a parse action that will replace the matched text with the # corresponding ANSI sequence ANSIreplacer.setParseAction(lambda toks: escLookup[toks[0]]) # now use the replacer to transform the test string; throw in some extra # ampersands to show what happens with non-matching sequences src = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog & &Zjumps back" out = ANSIreplacer.transformString(src) print repr(out)

打印：

 'The \x1b[0;30mquick \x1b[0;31mbrown \x1b[0;32mfox \x1b[0;33mjumps over the \x1b[0;34mlazy dog & &Zjumps back'

这肯定不会赢得任何表演比赛，但是如果你的标记开始变得更加复杂，那么拥有一个parsing器基础将使得它更容易扩展。

 >>> a=[] >>> str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog" >>> d={"&y":"\033[0;30m", ... "&c":"\033[0;31m", ... "&b":"\033[0;32m", ... "&Y":"\033[0;33m", ... "&u":"\033[0;34m"} >>> for item in str.split(): ... if item[:2] in d: ... a.append(d[item[:2]]+item[2:]) ... else: a.append(item) >>> print ' '.join(a)

尝试这个

tr.replace（ “＆Y”，字典[ “＆Y”]）

tr.replace（ “＆amp; C”，字典[ “＆amp; C”]）

tr.replace("&b",dict["&b"])

tr.replace("&Y",dict["&Y"])

tr.replace("&u",dict["&u"])

在python大量stringreplace？

Javareplace文本文件中的行

在Python 3中加速数以百万计的正则expression式replace

List中的元素用scalareplace

有没有一种有效的方法来在Ruby中执行数百个文本replace？

用Pythonreplace列表中的值

如何使用bash删除并replaceterminal中的最后一行？

重命名pandas列

Ruby – 用另一个stringreplace第一次出现的子string

在Visual Studio中多行查找和replace

正则expression式匹配/replaceJavaScript注释（多行和内联）