在Python中是否有`string.split（）`的生成器版本？

string.split()返回一个列表实例。有没有一个版本，而不是返回一个发电机 ？有没有任何理由反对发电机版本？

re.finditer极有可能使用相当小的内存开销。

 def split_iter(string): return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

演示：

 >>> list( split_iter("A programmer's RegEx test.") ) ['A', "programmer's", 'RegEx', 'test']

编辑：我刚刚证实，这需要在Python 3.2.1不断的内存，假设我的testing方法是正确的。我创build了一个非常大的string（1GB左右），然后用for循环遍历迭代（不是列表理解，这会产生额外的内存）。这并没有导致内存的显着增长（也就是说，如果内存有所增长，则远远低于1GB的内存）。

使用str.find()方法的offset参数来写一个最有效的方法。这避免了大量的内存使用，并且在不需要时依赖于正则expression式的开销。

[编辑2016-8-2：更新这个可选支持正则expression式分隔符]

 def isplit(source, sep=None, regex=False): """ generator version of str.split() :param source: source string (unicode or bytes) :param sep: separator to split on. :param regex: if True, will treat sep as regular expression. :returns: generator yielding elements of string. """ if sep is None: # mimic default python behavior source = source.strip() sep = "\\s+" if isinstance(source, bytes): sep = sep.encode("ascii") regex = True if regex: # version using re.finditer() if not hasattr(sep, "finditer"): sep = re.compile(sep) start = 0 for m in sep.finditer(source): idx = m.start() assert idx >= start yield source[start:idx] start = m.end() yield source[start:] else: # version using str.find(), less overhead than re.finditer() sepsize = len(sep) start = 0 while True: idx = source.find(sep, start) if idx == -1: yield source[start:] return yield source[start:idx] start = idx + sepsize

这可以像你想要的那样使用…

 >>> print list(isplit("abcb","b")) ['a','c','']

虽然每次执行find（）或slicing操作时都会在string内寻找一些成本，但是这应该是最小的，因为string在内存中表示为continguous数组。

这是通过re.search() split()实现的split()生成器版本，不存在分配太多子string的问题。

 import re def itersplit(s, sep=None): exp = re.compile(r'\s+' if sep is None else re.escape(sep)) pos = 0 while True: m = exp.search(s, pos) if not m: if pos < len(s) or sep is not None: yield s[pos:] break if pos < m.start() or sep is not None: yield s[pos:m.start()] pos = m.end() sample1 = "Good evening, world!" sample2 = " Good evening, world! " sample3 = "brackets][all][][over][here" sample4 = "][brackets][all][][over][here][" assert list(itersplit(sample1)) == sample1.split() assert list(itersplit(sample2)) == sample2.split() assert list(itersplit(sample3, '][')) == sample3.split('][') assert list(itersplit(sample4, '][')) == sample4.split('][')

编辑：纠正处理周围的空白，如果没有分隔符字符给出。

~~我没有看到split()生成器版本的任何明显的好处。~~ ~~生成器对象将不得不包含整个string迭代，所以你不会通过生成器来保存任何内存。~~

如果你想写一个它会很容易，虽然：

 import string def gsplit(s,sep=string.whitespace): word = [] for c in s: if c in sep: if word: yield "".join(word) word = [] else: word.append(c) if word: yield "".join(word)

这里是我的实现，比这里的其他答案更快，更完整。它有4个不同的情况下单独的子function。

我只是复制主str_split函数的文档string：

 str_split(s, *delims, empty=None)

将strings拆分为其余的参数，可能省略空白部分（ empty关键字参数负责）。这是一个生成器函数。

当只提供一个分隔符时，string被简单地分割。 empty默认为True 。

 str_split('[]aaa[][]bb[c', '[]') -> '', 'aaa', '', 'bb[c' str_split('[]aaa[][]bb[c', '[]', empty=False) -> 'aaa', 'bb[c'

当提供多个分隔符时，默认情况下，string会被这些分隔符的最长可能序列分割，或者，如果将empty值设置为True ，则分隔符之间的空string也将包含在内。请注意，在这种情况下，分隔符只能是单个字符。

 str_split('aaa, bb : c;', ' ', ',', ':', ';') -> 'aaa', 'bb', 'c' str_split('aaa, bb : c;', *' ,:;', empty=True) -> 'aaa', '', 'bb', '', '', 'c', ''

当没有提供分隔符时，使用string.whitespace ，所以效果和str.split()相同，除了这个函数是一个生成器。

 str_split('aaa\\t bb c \\n') -> 'aaa', 'bb', 'c'

 import string def _str_split_chars(s, delims): "Split the string `s` by characters contained in `delims`, including the \ empty parts between two consecutive delimiters" start = 0 for i, c in enumerate(s): if c in delims: yield s[start:i] start = i+1 yield s[start:] def _str_split_chars_ne(s, delims): "Split the string `s` by longest possible sequences of characters \ contained in `delims`" start = 0 in_s = False for i, c in enumerate(s): if c in delims: if in_s: yield s[start:i] in_s = False else: if not in_s: in_s = True start = i if in_s: yield s[start:] def _str_split_word(s, delim): "Split the string `s` by the string `delim`" dlen = len(delim) start = 0 try: while True: i = s.index(delim, start) yield s[start:i] start = i+dlen except ValueError: pass yield s[start:] def _str_split_word_ne(s, delim): "Split the string `s` by the string `delim`, not including empty parts \ between two consecutive delimiters" dlen = len(delim) start = 0 try: while True: i = s.index(delim, start) if start!=i: yield s[start:i] start = i+dlen except ValueError: pass if start<len(s): yield s[start:] def str_split(s, *delims, empty=None): """\ Split the string `s` by the rest of the arguments, possibly omitting empty parts (`empty` keyword argument is responsible for that). This is a generator function. When only one delimiter is supplied, the string is simply split by it. `empty` is then `True` by default. str_split('[]aaa[][]bb[c', '[]') -> '', 'aaa', '', 'bb[c' str_split('[]aaa[][]bb[c', '[]', empty=False) -> 'aaa', 'bb[c' When multiple delimiters are supplied, the string is split by longest possible sequences of those delimiters by default, or, if `empty` is set to `True`, empty strings between the delimiters are also included. Note that the delimiters in this case may only be single characters. str_split('aaa, bb : c;', ' ', ',', ':', ';') -> 'aaa', 'bb', 'c' str_split('aaa, bb : c;', *' ,:;', empty=True) -> 'aaa', '', 'bb', '', '', 'c', '' When no delimiters are supplied, `string.whitespace` is used, so the effect is the same as `str.split()`, except this function is a generator. str_split('aaa\\t bb c \\n') -> 'aaa', 'bb', 'c' """ if len(delims)==1: f = _str_split_word if empty is None or empty else _str_split_word_ne return f(s, delims[0]) if len(delims)==0: delims = string.whitespace delims = set(delims) if len(delims)>=4 else ''.join(delims) if any(len(d)>1 for d in delims): raise ValueError("Only 1-character multiple delimiters are supported") f = _str_split_chars if empty else _str_split_chars_ne return f(s, delims)

这个函数可以在Python 3中使用，并且可以应用一个简单而又相当难看的修复方法，使它能够在2和3版本中工作。函数的第一行应该改为：

 def str_split(s, *delims, **kwargs): """...docstring...""" empty = kwargs.get('empty')

不，但使用itertools.takewhile()编写一个应该很容易。

编辑：

很简单，半破解的实现：

 import itertools import string def isplitwords(s): i = iter(s) while True: r = [] for c in itertools.takewhile(lambda x: not x in string.whitespace, i): r.append(c) else: if r: yield ''.join(r) continue else: raise StopIteration()

 def split_generator(f,s): """ f is a string, s is the substring we split on. This produces a generator rather than a possibly memory intensive list. """ i=0 j=0 while j<len(f): if i>=len(f): yield f[j:] j=i elif f[i] != s: i=i+1 else: yield [f[j:i]] j=i+1 i=i+1

我写了一个@ ninjagecko的答案，其行为更像string.split（即默认分隔的空白，你可以指定一个分隔符）的版本。

 def isplit(string, delimiter = None): """Like string.split but returns an iterator (lazy) Multiple character delimters are not handled. """ if delimiter is None: # Whitespace delimited by default delim = r"\s" elif len(delimiter) != 1: raise ValueError("Can only handle single character delimiters", delimiter) else: # Escape, incase it's "\", "*" etc. delim = re.escape(delimiter) return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

这里是我使用的testing（在python 3和python 2中）：

 # Wrapper to make it a list def helper(*args, **kwargs): return list(isplit(*args, **kwargs)) # Normal delimiters assert helper("1,2,3", ",") == ["1", "2", "3"] assert helper("1;2;3,", ";") == ["1", "2", "3,"] assert helper("1;2 ;3, ", ";") == ["1", "2 ", "3, "] # Whitespace assert helper("1 2 3") == ["1", "2", "3"] assert helper("1\t2\t3") == ["1", "2", "3"] assert helper("1\t2 \t3") == ["1", "2", "3"] assert helper("1\n2\n3") == ["1", "2", "3"] # Surrounding whitespace dropped assert helper(" 1 2 3 ") == ["1", "2", "3"] # Regex special characters assert helper(r"1\2\3", "\\") == ["1", "2", "3"] assert helper(r"1*2*3", "*") == ["1", "2", "3"] # No multi-char delimiters allowed try: helper(r"1,.2,.3", ",.") assert False except ValueError: pass

python的正则expression式模块说，它为unicode空白“正确的事情” ，但我没有真正的testing它。

也可作为要点。

如果你还想读一个迭代器（以及返回一个），试试这个：

 import itertools as it def iter_split(string, sep=None): sep = sep or ' ' groups = it.groupby(string, lambda s: s != sep) return (''.join(g) for k, g in groups if k)

用法

 >>> list(iter_split(iter("Good evening, world!"))) ['Good', 'evening,', 'world!']

对提出的各种方法做了一些性能testing（这里我不再重复）。一些结果：

str.split （默认= 0.3461570239996945
手动search（按字符）（Dave Webb的答案之一）= 0.8260340550004912
re.finditer （ninjagecko的回答）= 0.698872097000276
str.find （Eli Collins的答案之一）= 0.7230395330007013
itertools.takewhile （Ignacio Vazquez-Abrams's answer）= 2.023023967998597
str.split(..., maxsplit=1) recursion = N / A†

†recursion答案（ string.split与maxsplit = 1 ）无法在合理的时间内完成，给定string.split s的速度，他们可能会更好地在较短的string，但是我看不到短string的用例内存不是问题无论如何。

使用timeittesting：

 the_text = "100 " * 9999 + "100" def test_function( method ): def fn( ): total = 0 for x in method( the_text ): total += int( x ) return total return fn

这引发了另外一个问题，即为什么string.split尽pipe内存使用更快。

你可以使用str.split自己创build一个限制：

 def isplit(s, sep=None): while s: parts = s.split(sep, 1) if len(parts) == 2: s = parts[1] else: s = '' yield parts[0]

这样，你不必复制strip（）的function和行为（例如，当sep = None时），这取决于它可能快速的本地实现。我认为一旦它有足够的“部分”，string.split将停止扫描分隔符的string。

正如格伦·梅纳德（Glenn Maynard）所指出的那样，这对大string（O（n ^ 2））来说是不好的。我已经通过'timit'testing证实了这一点。

对于我来说，至less需要将文件用作生成器。

这是我在准备一些空行分隔文本块的巨大文件时所做的准备（如果要在生产系统中使用它，需要对angular落案例进行彻底testing）：

 from __future__ import print_function def isplit(iterable, sep=None): r = '' for c in iterable: r += c if sep is None: if not c.strip(): r = r[:-1] if r: yield r r = '' elif r.endswith(sep): r=r[:-len(sep)] yield r r = '' if r: yield r def read_blocks(filename): """read a file as a sequence of blocks separated by empty line""" with open(filename) as ifh: for block in isplit(ifh, '\n\n'): yield block.splitlines() if __name__ == "__main__": for lineno, block in enumerate(read_blocks("logfile.txt"), 1): print(lineno,':') print('\n'.join(block)) print('-'*40) print('Testing skip with None.') for word in isplit('\tTony \t Jarkko \n Veijalainen\n'): print(word)

在Python中是否有`string.split（）`的生成器版本？

如何克隆一个Python生成器对象？

计算生成器/迭代器中项目数量的最简单方法是什么？

Java中的BarCode图像生成器

发电机线程安全吗？

如何从Python中的生成器获取一个值？

如何在MySQL中创build一个行生成器？

如何检查一个对象是否是Python中的生成器对象？

协程与持续vs发电机

如何len（发生器（））

如何outlookPython生成器中的一个元素？