在python大量stringreplace?
说我有一个string,看起来像这样:
str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
您会注意到string中存在和号的位置很多,后面跟着一个字符(例如“&y”和“&c”)。 我需要用字典中的适当值replace这些字符,如下所示:
dict = {"&y":"\033[0;30m", "&c":"\033[0;31m", "&b":"\033[0;32m", "&Y":"\033[0;33m", "&u":"\033[0;34m"}
什么是最快的方法来做到这一点? 我可以手动查找所有&符号,然后循环查看字典来更改它们,但是这似乎很慢。 做一堆正则expression式replace似乎也很慢(我将有一个约30-40对在我的实际代码字典)。
任何build议表示赞赏,谢谢。
编辑:
正如在这个问题的评论中已经指出的那样,我的字典是在运行时定义的,在应用程序生命周期中决不会改变。 这是一个ANSI转义序列的列表,并将有大约40个项目在其中。 我的平均string长度将比较大约500个字符,但会有一些是5000个字符(虽然,这些将是罕见的)。 我目前也在使用Python 2.6。
编辑#2我接受Tor Valamos的答案是正确的,因为它不仅提供了一个有效的解决scheme(尽pipe这不是最好的解决scheme),而是考虑到所有其他方面,并做了大量的工作来比较所有的问题。 这个答案是我在StackOverflow中遇到的最好,最有帮助的答案之一。 荣誉给你。
mydict = {"&y":"\033[0;30m", "&c":"\033[0;31m", "&b":"\033[0;32m", "&Y":"\033[0;33m", "&u":"\033[0;34m"} mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog" for k, v in mydict.iteritems(): mystr = mystr.replace(k, v) print mystr The ←[0;30mquick ←[0;31mbrown ←[0;32mfox ←[0;33mjumps over the ←[0;34mlazy dog
我冒昧地比较了几个解决scheme:
mydict = dict([('&' + chr(i), str(i)) for i in list(range(65, 91)) + list(range(97, 123))]) # random inserts between keys from random import randint rawstr = ''.join(mydict.keys()) mystr = '' for i in range(0, len(rawstr), 2): mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars from time import time # How many times to run each solution rep = 10000 print 'Running %d times with string length %d and ' \ 'random inserts of lengths 0-20' % (rep, len(mystr)) # My solution t = time() for x in range(rep): for k, v in mydict.items(): mystr.replace(k, v) #print(mystr) print '%-30s' % 'Tor fixed & variable dict', time()-t from re import sub, compile, escape # Peter Hansen t = time() for x in range(rep): sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict print '%-30s' % 'Peter fixed & variable dict', time()-t # Claudiu def multiple_replace(dict, text): # Create a regular expression from the dictionary keys regex = compile("(%s)" % "|".join(map(escape, dict.keys()))) # For each match, look-up corresponding value in dictionary return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) t = time() for x in range(rep): multiple_replace(mydict, mystr) print '%-30s' % 'Claudio variable dict', time()-t # Claudiu - Precompiled regex = compile("(%s)" % "|".join(map(escape, mydict.keys()))) t = time() for x in range(rep): regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr) print '%-30s' % 'Claudio fixed dict', time()-t # Andrew Y - variable dict def mysubst(somestr, somedict): subs = somestr.split("&") return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:])) t = time() for x in range(rep): mysubst(mystr, mydict) print '%-30s' % 'Andrew Y variable dict', time()-t # Andrew Y - fixed def repl(s): return mydict["&"+s[0:1]] + s[1:] t = time() for x in range(rep): subs = mystr.split("&") res = subs[0] + "".join(map(repl, subs[1:])) print '%-30s' % 'Andrew Y fixed dict', time()-t
结果在Python 2.6中
Running 10000 times with string length 490 and random inserts of lengths 0-20 Tor fixed & variable dict 1.04699993134 Peter fixed & variable dict 0.218999862671 Claudio variable dict 2.48400020599 Claudio fixed dict 0.0940001010895 Andrew Y variable dict 0.0309998989105 Andrew Y fixed dict 0.0310001373291
claudiu和安德鲁的解决scheme都保持为0,所以我不得不增加到10 000次。
我运行它在Python 3 (因为unicode)replace字符从39到1024(38是&符号,所以我不想包括它)。 string长度可达10.000,包括长度为0-20的variables随机插入的约980个replace。 从39到1024的unicode值会导致长度为1和2个字节的字符,这可能会影响某些解决scheme。
mydict = dict([('&' + chr(i), str(i)) for i in range(39,1024)]) # random inserts between keys from random import randint rawstr = ''.join(mydict.keys()) mystr = '' for i in range(0, len(rawstr), 2): mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars from time import time # How many times to run each solution rep = 10000 print('Running %d times with string length %d and ' \ 'random inserts of lengths 0-20' % (rep, len(mystr))) # Tor Valamo - too long #t = time() #for x in range(rep): # for k, v in mydict.items(): # mystr.replace(k, v) #print('%-30s' % 'Tor fixed & variable dict', time()-t) from re import sub, compile, escape # Peter Hansen t = time() for x in range(rep): sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict print('%-30s' % 'Peter fixed & variable dict', time()-t) # Peter 2 def dictsub(m): return mydict[m.group()] t = time() for x in range(rep): sub(r'(&[a-zA-Z])', dictsub, mystr) print('%-30s' % 'Peter fixed dict', time()-t) # Claudiu - too long #def multiple_replace(dict, text): # # Create a regular expression from the dictionary keys # regex = compile("(%s)" % "|".join(map(escape, dict.keys()))) # # # For each match, look-up corresponding value in dictionary # return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) # #t = time() #for x in range(rep): # multiple_replace(mydict, mystr) #print('%-30s' % 'Claudio variable dict', time()-t) # Claudiu - Precompiled regex = compile("(%s)" % "|".join(map(escape, mydict.keys()))) t = time() for x in range(rep): regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr) print('%-30s' % 'Claudio fixed dict', time()-t) # Separate setup for Andrew and gnibbler optimized dict mydict = dict((k[1], v) for k, v in mydict.items()) # Andrew Y - variable dict def mysubst(somestr, somedict): subs = somestr.split("&") return subs[0] + "".join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:])) def mysubst2(somestr, somedict): subs = somestr.split("&") return subs[0].join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:])) t = time() for x in range(rep): mysubst(mystr, mydict) print('%-30s' % 'Andrew Y variable dict', time()-t) t = time() for x in range(rep): mysubst2(mystr, mydict) print('%-30s' % 'Andrew Y variable dict 2', time()-t) # Andrew Y - fixed def repl(s): return mydict[s[0:1]] + s[1:] t = time() for x in range(rep): subs = mystr.split("&") res = subs[0] + "".join(map(repl, subs[1:])) print('%-30s' % 'Andrew Y fixed dict', time()-t) # gnibbler t = time() for x in range(rep): myparts = mystr.split("&") myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]] "".join(myparts) print('%-30s' % 'gnibbler fixed & variable dict', time()-t)
结果:
Running 10000 times with string length 9491 and random inserts of lengths 0-20 Tor fixed & variable dict 0.0 # disqualified 329 secs Peter fixed & variable dict 2.07799983025 Peter fixed dict 1.53100013733 Claudio variable dict 0.0 # disqualified, 37 secs Claudio fixed dict 1.5 Andrew Y variable dict 0.578000068665 Andrew Y variable dict 2 0.56299996376 Andrew Y fixed dict 0.56200003624 gnibbler fixed & variable dict 0.530999898911
(**注意,gnibbler的代码使用不同的字典,其中键没有包含'&',Andrew的代码也使用这个替代字典,但没有什么区别,可能只是加速的0.01倍。
试试这个,使用正则expression式replace和标准的string格式:
# using your stated values for str and dict: >>> import re >>> str = re.sub(r'(&[a-zA-Z])', r'%(\1)s', str) >>> str % dict 'The \x1b[0;30mquick \x1b[0;31mbrown \x1b[0;32mfox \x1b[0;33mjumps over the \x1b[0;34mlazy dog'
re.sub()调用replace所有的&符号序列,后面跟着包含相同模式的模式%(..)s的单个字母。
%格式化利用了string格式化function,可以使字典指定replace,而不是更常见的位置参数。
另一种方法可以直接在re.sub中使用callback来实现:
>>> import re >>> def dictsub(m): >>> return dict[m.group()] >>> str = re.sub(r'(&[a-zA-Z])', dictsub, str)
这一次我使用闭包从callback函数内引用字典。 这种方法可以给你更多的灵活性。 例如,你可以使用类似于dict.get(m.group(), '??')
以避免引发exception,如果你有string无法识别的代码序列。
(顺便说一句,“dict”和“str”都是内置的函数,如果你在自己的代码中使用这些名字,你会遇到麻烦,以防万一你不知道。这个问题当然是这样的。)
编辑:我决定检查Tor的testing代码,并得出结论说,这是远远不具有代表性,实际上越野车。 生成的string甚至没有和号(!)。 下面的修改后的代码生成一个代表性的字典和string,类似于OP的示例input。
我也想validation每个algorithm的输出是相同的。 下面是一个修改后的testing程序,只有Tor的,我的和Claudiu的代码 – 因为其他人打破了样本input。 (我认为它们都很脆弱,除非字典映射基本上所有可能的&符号序列,Tor的testing代码是这样做的。)这个正确的种子随机数生成器,所以每个运行是相同的。 最后,我添加了一个小的变化,使用一个生成器,避免了一些函数调用的开销,为了一个小的性能改进。
from time import time import string import random import re random.seed(1919096) # ensure consistent runs # build dictionary with 40 mappings, representative of original question mydict = dict(('&' + random.choice(string.letters), '\x1b[0;%sm' % (30+i)) for i in range(40)) # build simulated input, with mix of text, spaces, ampersands in reasonable proportions letters = string.letters + ' ' * 12 + '&' * 6 mystr = ''.join(random.choice(letters) for i in range(1000)) # How many times to run each solution rep = 10000 print('Running %d times with string length %d and %d ampersands' % (rep, len(mystr), mystr.count('&'))) # Tor Valamo # fixed from Tor's test, so it actually builds up the final string properly t = time() for x in range(rep): output = mystr for k, v in mydict.items(): output = output.replace(k, v) print('%-30s' % 'Tor fixed & variable dict', time() - t) # capture "known good" output as expected, to verify others expected = output # Peter Hansen # build charset to use in regex for safe dict lookup charset = ''.join(x[1] for x in mydict.keys()) # grab reference to method on regex, for speed patsub = re.compile(r'(&[%s])' % charset).sub t = time() for x in range(rep): output = patsub(r'%(\1)s', mystr) % mydict print('%-30s' % 'Peter fixed & variable dict', time()-t) assert output == expected # Peter 2 def dictsub(m): return mydict[m.group()] t = time() for x in range(rep): output = patsub(dictsub, mystr) print('%-30s' % 'Peter fixed dict', time() - t) assert output == expected # Peter 3 - freaky generator version, to avoid function call overhead def dictsub(d): m = yield None while 1: m = yield d[m.group()] dictsub = dictsub(mydict).send dictsub(None) # "prime" it t = time() for x in range(rep): output = patsub(dictsub, mystr) print('%-30s' % 'Peter generator', time() - t) assert output == expected # Claudiu - Precompiled regex_sub = re.compile("(%s)" % "|".join(mydict.keys())).sub t = time() for x in range(rep): output = regex_sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr) print('%-30s' % 'Claudio fixed dict', time() - t) assert output == expected
我之前忘记了包含基准testing结果:
运行10000次,string长度为1000,96个符号 ('Tor fixed&variable dict',2.9890000820159912) ('Peter fixed&variable dict',2.6659998893737793) ('Peter fixed dict',1.0920000076293945) ('彼得发电机',1.0460000038146973) ('Claudio fixed dict',1.562000036239624)
另外,input和正确的输出片段:
mystr = 'lTEQDMAPvksk k&z Txp vrnhQ GHaO&GNFY&&a...' mydict = {'&p': '\x1b[0;37m', '&q': '\x1b[0;66m', '&v': ...} output = 'lTEQDMAPvksk k←[0;57m Txp vrnhQ GHaO←[0;67mNFY&&a P...'
与我从Tor的testing代码输出中看到的结果比较:
mystr = 'VVVVVVVPPPPPPPPPPPPPPPXXXXXXXXYYYFFFFFFFFFFFFEEEEEEEEEEE...' mydict = {'&p': '112', '&q': '113', '&r': '114', '&s': '115', ...} output = # same as mystr since there were no ampersands inside
如果你真的想深入研究这个话题,请看这个: http : //en.wikipedia.org/wiki/Aho-Corasick_algorithm
通过迭代字典并replacestring中的每个元素,明显的解决scheme需要O(n*m)
时间,其中n是字典的大小,m是string的长度。
而Aho-Corasick-Algorithm在O(n+m+f)
中查找字典的所有条目,其中f是find的元素的数量。
如果列表中的键的数量很大,并且string中的出现次数很less(大多数为零),则可以遍历string中&符号的出现次数,并使用第一个string键入的字典子串的特征。 我不经常在python中编写代码,所以风格可能有点偏离,但是这里是我的看法:
str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog" dict = {"&y":"\033[0;30m", "&c":"\033[0;31m", "&b":"\033[0;32m", "&Y":"\033[0;33m", "&u":"\033[0;34m"} def rep(s): return dict["&"+s[0:1]] + s[1:] subs = str.split("&") res = subs[0] + "".join(map(rep, subs[1:])) print res
当然,有一个问题,当string本身有一个&符号时,会发生什么情况?在完成这个过程之前,您需要以某种方式逃避它,然后在这个过程之后逃避。
当然,就性能问题而言,通常情况下,对典型(也是最坏情况)数据集中的各种方法进行计时并对其进行比较是一件好事。
编辑:把它放到一个单独的函数来处理任意字典:
def mysubst(somestr, somedict): subs = somestr.split("&") return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
编辑2:摆脱不必要的连接,在许多迭代中似乎仍比以前快一点。
def mysubst(somestr, somedict): subs = somestr.split("&") return subs[0].join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
这里是Python的C扩展方法
const char *dvals[]={ //"0-64 "","","","","","","","","","", "","","","","","","","","","", "","","","","","","","","","", "","","","","","","","","","", "","","","","","","","","","", "","","","","","","","","","", "","","","","", //AZ "","","","","", "","","","","", "","","","","", "","","","","", "","","","","33", "", // "","","","","","", //az "","32","31","","", "","","","","", "","","","","", "","","","","", "34","","","","30", "" }; int dsub(char*d,char*s){ char *ofs=d; do{ if(*s=='&' && s[1]<='z' && *dvals[s[1]]){ //\033[0; *d++='\\',*d++='0',*d++='3',*d++='3',*d++='[',*d++='0',*d++=';'; //consider as fixed 2 digits *d++=dvals[s[1]][0]; *d++=dvals[s[1]][1]; *d++='m'; s++; //skip //non &,invalid, unused (&) ampersand sequences will go here. }else *d++=*s; }while(*s++); return d-ofs-1; }
我testing过的Python代码
from mylib import * import time start=time.time() instr="The &yquick &cbrown &bfox &Yjumps over the &ulazy dog, skip &Unknown.\n"*100000 x=dsub(instr) end=time.time() print "time taken",end-start,",input str length",len(x) print "first few lines" print x[:1100]
结果
time taken 0.140000104904 ,input str length 11000000 first few lines The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown. The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
它假设能够在O(n)上运行,并且在My Mobile Celeron 1.6 GHz PC上只花费了160 ms(avg)来处理11 MB的string
它也会跳过未知字符,例如&Unknown
将按原样返回
让我知道如果你有任何编译,错误等问题…
这似乎是你想做的 – 多个string一次使用RegExpsreplace。 这是相关的代码:
def multiple_replace(dict, text): # Create a regular expression from the dictionary keys regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys()))) # For each match, look-up corresponding value in dictionary return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) print multiple_replace(dict, str)
定义replace规则的一般解决scheme是使用正则expression式replace使用函数来提供映射(请参阅re.sub() )。
import re str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog" dict = {"&y":"\033[0;30m", "&c":"\033[0;31m", "&b":"\033[0;32m", "&Y":"\033[0;33m", "&u":"\033[0;34m"} def programmaticReplacement( match ): return dict[ match.group( 1 ) ] colorstring = re.sub( '(\&.)', programmaticReplacement, str )
这对于不重要的替代(例如任何需要math运算来创build替代品)来说都是非常好的。
这是一个使用split / join的版本
mydict = {"y":"\033[0;30m", "c":"\033[0;31m", "b":"\033[0;32m", "Y":"\033[0;33m", "u":"\033[0;34m"} mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog" myparts = mystr.split("&") myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]] print "".join(myparts)
如果有和代码无效的代码,你可以用它来保存它们
myparts[1:]=[mydict.get(x[0],"&"+x[0])+x[1:] for x in myparts[1:]]
彼得·汉森(Peter Hansen)指出,当双重&符号的时候,这个失败了。 在这种情况下使用这个版本
mystr = "The &yquick &cbrown &bfox &Yjumps over the &&ulazy dog" myparts = mystr.split("&") myparts[1:]=[mydict.get(x[:1],"&"+x[:1])+x[1:] for x in myparts[1:]] print "".join(myparts)
不知道这个解决scheme的速度,但你可以循环你的字典,并反复调用内置
str.replace(old, new)
如果原始string不太长,这可能会performance得相当好,但是当string变长时显然会受到影响。
在Python中进行大量replace的问题是string的不可变性:每次您将replacestring中的一个项目,然后整个新的string将一次又一次地从堆中重新分配。
所以,如果你想要最快的解决scheme,你需要使用可变容器(例如列表),或者在普通的C(或更好的Pyrex或Cython)中写这个机器。 在任何情况下,我build议基于简单的有限状态机编写简单的parsing器,然后逐个inputstring的符号。
build议的解决scheme基于正则expression式以类似的方式工作,因为regexp在场景后面使用fsm。
由于有人提到使用一个简单的parsing器,我想我会做一个使用pyparsing。 通过使用pyparsing的transformString方法,pyparsing在内部扫描源string,并构build匹配的文本和中介文本的列表。 当所有完成时,transformString然后''.join的这个列表,所以没有性能问题,build立string的增量。 (为ANSIreplacer定义的parsing操作将匹配的&_字符转换为所需的转义序列,并将匹配的文本replace为parsing操作的输出。因为只有匹配的序列才能满足parsing器expression式,所以不需要parsing操作来处理未定义的&_序列。)
FollowedBy('&')并不是绝对必要的,但它通过validationparsing器实际上是否位于&符之前,对所有标记选项进行更昂贵的检查之前,快捷地parsing了parsing过程。
from pyparsing import FollowedBy, oneOf escLookup = {"&y":"\033[0;30m", "&c":"\033[0;31m", "&b":"\033[0;32m", "&Y":"\033[0;33m", "&u":"\033[0;34m"} # make a single expression that will look for a leading '&', then try to # match each of the escape expressions ANSIreplacer = FollowedBy('&') + oneOf(escLookup.keys()) # add a parse action that will replace the matched text with the # corresponding ANSI sequence ANSIreplacer.setParseAction(lambda toks: escLookup[toks[0]]) # now use the replacer to transform the test string; throw in some extra # ampersands to show what happens with non-matching sequences src = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog & &Zjumps back" out = ANSIreplacer.transformString(src) print repr(out)
打印:
'The \x1b[0;30mquick \x1b[0;31mbrown \x1b[0;32mfox \x1b[0;33mjumps over the \x1b[0;34mlazy dog & &Zjumps back'
这肯定不会赢得任何表演比赛,但是如果你的标记开始变得更加复杂,那么拥有一个parsing器基础将使得它更容易扩展。
>>> a=[] >>> str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog" >>> d={"&y":"\033[0;30m", ... "&c":"\033[0;31m", ... "&b":"\033[0;32m", ... "&Y":"\033[0;33m", ... "&u":"\033[0;34m"} >>> for item in str.split(): ... if item[:2] in d: ... a.append(d[item[:2]]+item[2:]) ... else: a.append(item) >>> print ' '.join(a)
尝试这个
tr.replace( “&Y”,字典[ “&Y”])
tr.replace( “&amp; C”,字典[ “&amp; C”])
tr.replace("&b",dict["&b"])
tr.replace("&Y",dict["&Y"])
tr.replace("&u",dict["&u"])