在string中查找第n个子string

这似乎应该是相当微不足道的，但我是新的Python，并希望做到最Python的方式。

我想在string中find第n个子string的出现。

有什么相当于我想要做的是什么

mystring.find("substring", 2nd)

你怎么能在Python中实现这一点？

马克的迭代方法将是通常的方式，我想。

这是一个string拆分的替代scheme，通常可以用于查找相关的进程：

 def findnth(haystack, needle, n): parts= haystack.split(needle, n+1) if len(parts)<=n+1: return -1 return len(haystack)-len(parts[-1])-len(needle)

这里有一个快速的（有点肮脏，因为你必须select一些不能匹配针的糠））单行：

 'foo bar bar bar'.replace('bar', 'XXX', 1).find('bar')

这是一个更直接的迭代解决scheme的Pythonic版本：

 def find_nth(haystack, needle, n): start = haystack.find(needle) while start >= 0 and n > 1: start = haystack.find(needle, start+len(needle)) n -= 1 return start

例：

 >>> find_nth("foofoofoofoo", "foofoo", 2) 6

如果你想findneedle的第n个重叠发生，你可以增加1而不是len(needle) ，如下所示：

 def find_nth_overlapping(haystack, needle, n): start = haystack.find(needle) while start >= 0 and n > 1: start = haystack.find(needle, start+1) n -= 1 return start

例：

 >>> find_nth_overlapping("foofoofoofoo", "foofoo", 2) 3

这比Mark的版本更容易阅读，而且不需要分割版本的额外内存或导入正则expression式模块。它也遵循python中的一些规则，而不像其他的方法：

简单胜于复杂。
平面比嵌套更好。
可读性计数。

理解正则expression式并不总是最好的解决scheme，我可能会在这里使用一个：

 >>> import re >>> s = "ababdfegtduab" >>> [m.start() for m in re.finditer(r"ab",s)] [0, 2, 11] >>> [m.start() for m in re.finditer(r"ab",s)][2] #index 2 is third occurrence 11

这将在string中find第二次出现的子string。

 def find_2nd(string, substring): return string.find(substring, string.find(substring) + 1)

findnth() @ findnth() （基于str.split() ）与@ tgamblin's或@Mark Byers的find_nth() （基于str.split() ）来提供一些比较迄今为止最显着的方法的基准testing结果。 str.find() ）。我也会用一个C扩展名（ _find_nth.so ）来比较一下我们可以走多远。这里是find_nth.py ：

 def findnth(haystack, needle, n): parts= haystack.split(needle, n+1) if len(parts)<=n+1: return -1 return len(haystack)-len(parts[-1])-len(needle) def find_nth(s, x, n=0, overlap=False): l = 1 if overlap else len(x) i = -l for c in xrange(n + 1): i = s.find(x, i + l) if i < 0: break return i

当然，如果string很大，性能最重要，所以假设我们想在一个名为'bigfile'的1.3 GB文件中find1000001的换行符（'\ n'）。为了节省内存，我们想处理文件的mmap.mmap对象表示：

 In [1]: import _find_nth, find_nth, mmap In [2]: f = open('bigfile', 'r') In [3]: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

findnth()已经有了第一个问题，因为mmap.mmap对象不支持split() 。所以我们实际上必须将整个文件复制到内存中：

 In [4]: %time s = mm[:] CPU times: user 813 ms, sys: 3.25 s, total: 4.06 s Wall time: 17.7 s

哎哟! 幸运的s仍然适合我的Macbook Air的4 GB内存，所以让我们来testingfindnth() ：

 In [5]: %timeit find_nth.findnth(s, '\n', 1000000) 1 loops, best of 3: 29.9 s per loop

显然是一个可怕的performance。让我们看看基于str.find()的方法如何：

 In [6]: %timeit find_nth.find_nth(s, '\n', 1000000) 1 loops, best of 3: 774 ms per loop

好多了！显然， findnth()的问题在于它在split()被迫复制string，这已经是我们第二次在s = mm[:]之后复制了1.3 GB的数据。这里有find_nth()的第二个优点：我们可以直接在mm上使用它，这样文件的零拷贝是必需的：

 In [7]: %timeit find_nth.find_nth(mm, '\n', 1000000) 1 loops, best of 3: 1.21 s per loop

在mm和s似乎有一个很小的性能损失，但是这说明find_nth()可以在1.2秒内得到答案，而findnth的总计是47秒。

我发现没有什么情况下基于str.find()的方法明显比基于str.find()的方法差，所以在这一点上，我认为@tgamblin或@Mark Byers的答案应该被接受而不是@ bobince 。

在我的testing中，上面的find_nth()版本是我能想出的最快的纯Python解决scheme（与@Mark Byers的版本非常相似）。让我们看看我们可以用C扩展模块做得更好。这里是_find_nthmodule.c ：

 #include <Python.h> #include <string.h> off_t _find_nth(const char *buf, size_t l, char c, int n) { off_t i; for (i = 0; i < l; ++i) { if (buf[i] == c && n-- == 0) { return i; } } return -1; } off_t _find_nth2(const char *buf, size_t l, char c, int n) { const char *b = buf - 1; do { b = memchr(b + 1, c, l); if (!b) return -1; } while (n--); return b - buf; } /* mmap_object is private in mmapmodule.c - replicate beginning here */ typedef struct { PyObject_HEAD char *data; size_t size; } mmap_object; typedef struct { const char *s; size_t l; char c; int n; } params; int parse_args(PyObject *args, params *P) { PyObject *obj; const char *x; if (!PyArg_ParseTuple(args, "Osi", &obj, &x, &P->n)) { return 1; } PyTypeObject *type = Py_TYPE(obj); if (type == &PyString_Type) { P->s = PyString_AS_STRING(obj); P->l = PyString_GET_SIZE(obj); } else if (!strcmp(type->tp_name, "mmap.mmap")) { mmap_object *m_obj = (mmap_object*) obj; P->s = m_obj->data; P->l = m_obj->size; } else { PyErr_SetString(PyExc_TypeError, "Cannot obtain char * from argument 0"); return 1; } P->c = x[0]; return 0; } static PyObject* py_find_nth(PyObject *self, PyObject *args) { params P; if (!parse_args(args, &P)) { return Py_BuildValue("i", _find_nth(Ps, Pl, Pc, Pn)); } else { return NULL; } } static PyObject* py_find_nth2(PyObject *self, PyObject *args) { params P; if (!parse_args(args, &P)) { return Py_BuildValue("i", _find_nth2(Ps, Pl, Pc, Pn)); } else { return NULL; } } static PyMethodDef methods[] = { {"find_nth", py_find_nth, METH_VARARGS, ""}, {"find_nth2", py_find_nth2, METH_VARARGS, ""}, {0} }; PyMODINIT_FUNC init_find_nth(void) { Py_InitModule("_find_nth", methods); }

这是setup.py文件：

 from distutils.core import setup, Extension module = Extension('_find_nth', sources=['_find_nthmodule.c']) setup(ext_modules=[module])

像往常一样python setup.py install 。由于C代码仅限于查找单个字符，因此C代码在这方面起到了很大的作用，但是让我们看看这是多快：

 In [8]: %timeit _find_nth.find_nth(mm, '\n', 1000000) 1 loops, best of 3: 218 ms per loop In [9]: %timeit _find_nth.find_nth(s, '\n', 1000000) 1 loops, best of 3: 216 ms per loop In [10]: %timeit _find_nth.find_nth2(mm, '\n', 1000000) 1 loops, best of 3: 307 ms per loop In [11]: %timeit _find_nth.find_nth2(s, '\n', 1000000) 1 loops, best of 3: 304 ms per loop

显然还是比较快。有趣的是，内存和mmap之间的C级没有差别。同样有趣的是，基于string.h的memchr()库函数的_find_nth()与_find_nth()中的简单实现失败了： memchr()中额外的“优化”显然是不起作用的。 ..

总之， findnth()的实现（基于str.split() ）实际上是一个坏主意，因为（a）由于需要的复制，它对于更大的stringperformance得非常糟糕;（b）它不能工作mmap.mmap对象。 find_nth()的实现（基于str.find() ）应该在所有情况下都是首选的（因此是这个问题的可接受的答案）。

还有相当多的改进空间，因为C扩展的速度比纯Python代码快了近4倍，这表明可能有专门的Python库函数的情况。

我可能会做这样的事情，使用带有索引参数的find函数：

 def find_nth(s, x, n): i = -1 for _ in range(n): i = s.find(x, i + len(x)) if i == -1: break return i print find_nth('bananabanana', 'an', 3)

我想这不是Pythonic，但很简单。你可以使用recursion来代替：

 def find_nth(s, x, n, i = 0): i = s.find(x, i) if n == 1 or i == -1: return i else: return find_nth(s, x, n - 1, i + len(x)) print find_nth('bananabanana', 'an', 3)

这是解决这个问题的一种function性的方式，但是我不知道这是否会使它变得更加Pythonic。

最简单的方法？

 text = "This is a test from a test ok" firstTest = text.find('test') print text.find('test', firstTest + 1)

这是另一个re + itertools版本，它应该在searchstr或RegexpObject 。我会毫不犹豫地承认，这可能过度devise，但由于某种原因，它招待我。

 import itertools import re def find_nth(haystack, needle, n = 1): """ Find the starting index of the nth occurrence of ``needle`` in \ ``haystack``. If ``needle`` is a ``str``, this will perform an exact substring match; if it is a ``RegexpObject``, this will perform a regex search. If ``needle`` doesn't appear in ``haystack``, return ``-1``. If ``needle`` doesn't appear in ``haystack`` ``n`` times, return ``-1``. Arguments --------- * ``needle`` the substring (or a ``RegexpObject``) to find * ``haystack`` is a ``str`` * an ``int`` indicating which occurrence to find; defaults to ``1`` >>> find_nth("foo", "o", 1) 1 >>> find_nth("foo", "o", 2) 2 >>> find_nth("foo", "o", 3) -1 >>> find_nth("foo", "b") -1 >>> import re >>> either_o = re.compile("[oO]") >>> find_nth("foo", either_o, 1) 1 >>> find_nth("FOO", either_o, 1) 1 """ if (hasattr(needle, 'finditer')): matches = needle.finditer(haystack) else: matches = re.finditer(re.escape(needle), haystack) start_here = itertools.dropwhile(lambda x: x[0] < n, enumerate(matches, 1)) try: return next(start_here)[1].start() except StopIteration: return -1

这是使用re.finditer的另一种方法。
不同的是，这只是在必要的时候才把大海捞针

 from re import finditer from itertools import dropwhile needle='an' haystack='bananabanana' n=2 next(dropwhile(lambda x: x[0]<n, enumerate(re.finditer(needle,haystack))))[1].start()

 >>> s="abcdefabcdefababcdef" >>> j=0 >>> for n,i in enumerate(s): ... if s[n:n+2] =="ab": ... print n,i ... j=j+1 ... if j==2: print "2nd occurence at index position: ",n ... 0 a 6 a 2nd occurence at index position: 6 12 a 14 a

这会给你一个匹配yourstring的起始索引数组：

 import re indices = [s.start() for s in re.finditer(':', yourstring)]

那么你的第n项将是：

 n = 2 nth_entry = indices[n-1]

当然，你必须小心指数边界。你可以像这样获得你的yourstring的实例的数量：

 num_instances = len(indices)

更换一个class轮是伟大的，但只能工作，因为XX和酒吧有相同的lentgh

一个好的和一般的def将是：

 def findN(s,sub,N,replaceString="XXX"): return s.replace(sub,replaceString,N-1).find(sub) - (len(replaceString)-len(sub))*(N-1)

提供另一个“棘手”的解决scheme，它使用split和join 。

在你的例子中，我们可以使用

 len("substring".join([s for s in ori.split("substring")[:2]]))

怎么样：

 c = os.getcwd().split('\\') print '\\'.join(c[0:-2])

这是你真正想要的答案：

 def Find(String,ToFind,Occurence = 1): index = 0 count = 0 while index <= len(String): try: if String[index:index + len(ToFind)] == ToFind: count += 1 if count == Occurence: return index break index += 1 except IndexError: return False break return False

build立在modle13的答案，但没有re模块依赖。

 def iter_find(haystack, needle): return [i for i in range(0, len(haystack)) if haystack[i:].startswith(needle)]

我有点希望这是一个内置的string方法。

 >>> iter_find("http://stackoverflow.com/questions/1883980/", '/') [5, 6, 24, 34, 42]

在string中查找第n个子string

命名Pythonlogging器

以像素为单位指定和保存精确大小的graphics

命名为可选的关键字参数

如何在Python中search元组列表

用bools作为整数是Pythonic吗？

在python中定义一个类的“boolness”

展开列表（不规则）列表

无法执行collectstatic

Django持久数据库连接

如何获得给定的装饰器的Python类的所有方法