在Python中是否有`string.split()`的生成器版本?
string.split()
返回一个列表实例。 有没有一个版本,而不是返回一个发电机 ? 有没有任何理由反对发电机版本?
re.finditer
极有可能使用相当小的内存开销。
def split_iter(string): return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))
演示:
>>> list( split_iter("A programmer's RegEx test.") ) ['A', "programmer's", 'RegEx', 'test']
编辑:我刚刚证实,这需要在Python 3.2.1不断的内存,假设我的testing方法是正确的。 我创build了一个非常大的string(1GB左右),然后用for
循环遍历迭代(不是列表理解,这会产生额外的内存)。 这并没有导致内存的显着增长(也就是说,如果内存有所增长,则远远低于1GB的内存)。
使用str.find()
方法的offset
参数来写一个最有效的方法。 这避免了大量的内存使用,并且在不需要时依赖于正则expression式的开销。
[编辑2016-8-2:更新这个可选支持正则expression式分隔符]
def isplit(source, sep=None, regex=False): """ generator version of str.split() :param source: source string (unicode or bytes) :param sep: separator to split on. :param regex: if True, will treat sep as regular expression. :returns: generator yielding elements of string. """ if sep is None: # mimic default python behavior source = source.strip() sep = "\\s+" if isinstance(source, bytes): sep = sep.encode("ascii") regex = True if regex: # version using re.finditer() if not hasattr(sep, "finditer"): sep = re.compile(sep) start = 0 for m in sep.finditer(source): idx = m.start() assert idx >= start yield source[start:idx] start = m.end() yield source[start:] else: # version using str.find(), less overhead than re.finditer() sepsize = len(sep) start = 0 while True: idx = source.find(sep, start) if idx == -1: yield source[start:] return yield source[start:idx] start = idx + sepsize
这可以像你想要的那样使用…
>>> print list(isplit("abcb","b")) ['a','c','']
虽然每次执行find()或slicing操作时都会在string内寻找一些成本,但是这应该是最小的,因为string在内存中表示为continguous数组。
这是通过re.search()
split()
实现的split()
生成器版本,不存在分配太多子string的问题。
import re def itersplit(s, sep=None): exp = re.compile(r'\s+' if sep is None else re.escape(sep)) pos = 0 while True: m = exp.search(s, pos) if not m: if pos < len(s) or sep is not None: yield s[pos:] break if pos < m.start() or sep is not None: yield s[pos:m.start()] pos = m.end() sample1 = "Good evening, world!" sample2 = " Good evening, world! " sample3 = "brackets][all][][over][here" sample4 = "][brackets][all][][over][here][" assert list(itersplit(sample1)) == sample1.split() assert list(itersplit(sample2)) == sample2.split() assert list(itersplit(sample3, '][')) == sample3.split('][') assert list(itersplit(sample4, '][')) == sample4.split('][')
编辑:纠正处理周围的空白,如果没有分隔符字符给出。
我没有看到 split()
生成器版本的任何明显的好处。生成器对象将不得不包含整个string迭代,所以你不会通过生成器来保存任何内存。
如果你想写一个它会很容易,虽然:
import string def gsplit(s,sep=string.whitespace): word = [] for c in s: if c in sep: if word: yield "".join(word) word = [] else: word.append(c) if word: yield "".join(word)
这里是我的实现,比这里的其他答案更快,更完整。 它有4个不同的情况下单独的子function。
我只是复制主str_split
函数的文档string:
str_split(s, *delims, empty=None)
将strings
拆分为其余的参数,可能省略空白部分( empty
关键字参数负责)。 这是一个生成器函数。
当只提供一个分隔符时,string被简单地分割。 empty
默认为True
。
str_split('[]aaa[][]bb[c', '[]') -> '', 'aaa', '', 'bb[c' str_split('[]aaa[][]bb[c', '[]', empty=False) -> 'aaa', 'bb[c'
当提供多个分隔符时,默认情况下,string会被这些分隔符的最长可能序列分割,或者,如果将empty
值设置为True
,则分隔符之间的空string也将包含在内。 请注意,在这种情况下,分隔符只能是单个字符。
str_split('aaa, bb : c;', ' ', ',', ':', ';') -> 'aaa', 'bb', 'c' str_split('aaa, bb : c;', *' ,:;', empty=True) -> 'aaa', '', 'bb', '', '', 'c', ''
当没有提供分隔符时,使用string.whitespace
,所以效果和str.split()
相同,除了这个函数是一个生成器。
str_split('aaa\\t bb c \\n') -> 'aaa', 'bb', 'c'
import string def _str_split_chars(s, delims): "Split the string `s` by characters contained in `delims`, including the \ empty parts between two consecutive delimiters" start = 0 for i, c in enumerate(s): if c in delims: yield s[start:i] start = i+1 yield s[start:] def _str_split_chars_ne(s, delims): "Split the string `s` by longest possible sequences of characters \ contained in `delims`" start = 0 in_s = False for i, c in enumerate(s): if c in delims: if in_s: yield s[start:i] in_s = False else: if not in_s: in_s = True start = i if in_s: yield s[start:] def _str_split_word(s, delim): "Split the string `s` by the string `delim`" dlen = len(delim) start = 0 try: while True: i = s.index(delim, start) yield s[start:i] start = i+dlen except ValueError: pass yield s[start:] def _str_split_word_ne(s, delim): "Split the string `s` by the string `delim`, not including empty parts \ between two consecutive delimiters" dlen = len(delim) start = 0 try: while True: i = s.index(delim, start) if start!=i: yield s[start:i] start = i+dlen except ValueError: pass if start<len(s): yield s[start:] def str_split(s, *delims, empty=None): """\ Split the string `s` by the rest of the arguments, possibly omitting empty parts (`empty` keyword argument is responsible for that). This is a generator function. When only one delimiter is supplied, the string is simply split by it. `empty` is then `True` by default. str_split('[]aaa[][]bb[c', '[]') -> '', 'aaa', '', 'bb[c' str_split('[]aaa[][]bb[c', '[]', empty=False) -> 'aaa', 'bb[c' When multiple delimiters are supplied, the string is split by longest possible sequences of those delimiters by default, or, if `empty` is set to `True`, empty strings between the delimiters are also included. Note that the delimiters in this case may only be single characters. str_split('aaa, bb : c;', ' ', ',', ':', ';') -> 'aaa', 'bb', 'c' str_split('aaa, bb : c;', *' ,:;', empty=True) -> 'aaa', '', 'bb', '', '', 'c', '' When no delimiters are supplied, `string.whitespace` is used, so the effect is the same as `str.split()`, except this function is a generator. str_split('aaa\\t bb c \\n') -> 'aaa', 'bb', 'c' """ if len(delims)==1: f = _str_split_word if empty is None or empty else _str_split_word_ne return f(s, delims[0]) if len(delims)==0: delims = string.whitespace delims = set(delims) if len(delims)>=4 else ''.join(delims) if any(len(d)>1 for d in delims): raise ValueError("Only 1-character multiple delimiters are supported") f = _str_split_chars if empty else _str_split_chars_ne return f(s, delims)
这个函数可以在Python 3中使用,并且可以应用一个简单而又相当难看的修复方法,使它能够在2和3版本中工作。 函数的第一行应该改为:
def str_split(s, *delims, **kwargs): """...docstring...""" empty = kwargs.get('empty')
不,但使用itertools.takewhile()
编写一个应该很容易。
编辑:
很简单,半破解的实现:
import itertools import string def isplitwords(s): i = iter(s) while True: r = [] for c in itertools.takewhile(lambda x: not x in string.whitespace, i): r.append(c) else: if r: yield ''.join(r) continue else: raise StopIteration()
def split_generator(f,s): """ f is a string, s is the substring we split on. This produces a generator rather than a possibly memory intensive list. """ i=0 j=0 while j<len(f): if i>=len(f): yield f[j:] j=i elif f[i] != s: i=i+1 else: yield [f[j:i]] j=i+1 i=i+1
我写了一个@ ninjagecko的答案,其行为更像string.split(即默认分隔的空白,你可以指定一个分隔符)的版本。
def isplit(string, delimiter = None): """Like string.split but returns an iterator (lazy) Multiple character delimters are not handled. """ if delimiter is None: # Whitespace delimited by default delim = r"\s" elif len(delimiter) != 1: raise ValueError("Can only handle single character delimiters", delimiter) else: # Escape, incase it's "\", "*" etc. delim = re.escape(delimiter) return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))
这里是我使用的testing(在python 3和python 2中):
# Wrapper to make it a list def helper(*args, **kwargs): return list(isplit(*args, **kwargs)) # Normal delimiters assert helper("1,2,3", ",") == ["1", "2", "3"] assert helper("1;2;3,", ";") == ["1", "2", "3,"] assert helper("1;2 ;3, ", ";") == ["1", "2 ", "3, "] # Whitespace assert helper("1 2 3") == ["1", "2", "3"] assert helper("1\t2\t3") == ["1", "2", "3"] assert helper("1\t2 \t3") == ["1", "2", "3"] assert helper("1\n2\n3") == ["1", "2", "3"] # Surrounding whitespace dropped assert helper(" 1 2 3 ") == ["1", "2", "3"] # Regex special characters assert helper(r"1\2\3", "\\") == ["1", "2", "3"] assert helper(r"1*2*3", "*") == ["1", "2", "3"] # No multi-char delimiters allowed try: helper(r"1,.2,.3", ",.") assert False except ValueError: pass
python的正则expression式模块说,它为unicode空白“正确的事情” ,但我没有真正的testing它。
也可作为要点 。
如果你还想读一个迭代器(以及返回一个),试试这个:
import itertools as it def iter_split(string, sep=None): sep = sep or ' ' groups = it.groupby(string, lambda s: s != sep) return (''.join(g) for k, g in groups if k)
用法
>>> list(iter_split(iter("Good evening, world!"))) ['Good', 'evening,', 'world!']
对提出的各种方法做了一些性能testing(这里我不再重复)。 一些结果:
-
str.split
(默认= 0.3461570239996945 - 手动search(按字符)(Dave Webb的答案之一)= 0.8260340550004912
-
re.finditer
(ninjagecko的回答)= 0.698872097000276 -
str.find
(Eli Collins的答案之一)= 0.7230395330007013 -
itertools.takewhile
(Ignacio Vazquez-Abrams's answer)= 2.023023967998597 -
str.split(..., maxsplit=1)
recursion = N / A†
†recursion答案( string.split
与maxsplit = 1
)无法在合理的时间内完成,给定string.split
s的速度,他们可能会更好地在较短的string,但是我看不到短string的用例内存不是问题无论如何。
使用timeit
testing:
the_text = "100 " * 9999 + "100" def test_function( method ): def fn( ): total = 0 for x in method( the_text ): total += int( x ) return total return fn
这引发了另外一个问题,即为什么string.split
尽pipe内存使用更快。
你可以使用str.split自己创build一个限制:
def isplit(s, sep=None): while s: parts = s.split(sep, 1) if len(parts) == 2: s = parts[1] else: s = '' yield parts[0]
这样,你不必复制strip()的function和行为(例如,当sep = None时),这取决于它可能快速的本地实现。 我认为一旦它有足够的“部分”,string.split将停止扫描分隔符的string。
正如格伦·梅纳德(Glenn Maynard)所指出的那样,这对大string(O(n ^ 2))来说是不好的。 我已经通过'timit'testing证实了这一点。
对于我来说,至less需要将文件用作生成器。
这是我在准备一些空行分隔文本块的巨大文件时所做的准备(如果要在生产系统中使用它,需要对angular落案例进行彻底testing):
from __future__ import print_function def isplit(iterable, sep=None): r = '' for c in iterable: r += c if sep is None: if not c.strip(): r = r[:-1] if r: yield r r = '' elif r.endswith(sep): r=r[:-len(sep)] yield r r = '' if r: yield r def read_blocks(filename): """read a file as a sequence of blocks separated by empty line""" with open(filename) as ifh: for block in isplit(ifh, '\n\n'): yield block.splitlines() if __name__ == "__main__": for lineno, block in enumerate(read_blocks("logfile.txt"), 1): print(lineno,':') print('\n'.join(block)) print('-'*40) print('Testing skip with None.') for word in isplit('\tTony \t Jarkko \n Veijalainen\n'): print(word)