如何在string.replace中input正则expression式?
我需要一些关于声明正则expression式的帮助。 我的input如下所示:
this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>
所需的输出是:
this is a paragraph with in between and then there are cases ... where the number ranges from 1-100. and there are many other lines in the txt files with such tags
我试过这个:
#!/usr/bin/python import os, sys, re, glob for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')): for line in reader: line2 = line.replace('<[1> ', '') line = line2.replace('</[1> ', '') line2 = line.replace('<[1>', '') line = line2.replace('</[1>', '') print line
我也试过这个(但似乎我使用错误的正则expression式语法):
line2 = line.replace('<[*> ', '') line = line2.replace('</[*> ', '') line2 = line.replace('<[*>', '') line = line2.replace('</[*>', '')
我不想硬编码replace
从1到99。 。 。
这个testing片段应该这样做:
import re line = re.sub(r"</?\[\d+>", "", line)
编辑:这是一个注释版本,解释它是如何工作的:
line = re.sub(r""" (?x) # Use free-spacing mode. < # Match a literal '<' /? # Optionally match a '/' \[ # Match a literal '[' \d+ # Match one or more digits > # Match a literal '>' """, "", line)
正则expression式很有趣! 但我强烈build议花一两个小时来学习基础知识。 对于初学者来说,你需要知道哪些字符是特殊的:需要转义的“元字符” (即在前面放置一个反斜杠 – 规则在字符类内部和外部是不同的)。有一个很好的在线教程: www .regular-expressions.info 。 你花在那里的时间会多次为自己付出。 快乐的regexing!
str.replace()
做了固定的replace。 使用re.sub()
来代替。
我会这样去(正则expression式在评论中解释):
import re # If you need to use the regex more than once it is suggested to compile it. pattern = re.compile(r"</{0,}\[\d+>") # <\/{0,}\[\d+> # # Match the character “<” literally «<» # Match the character “/” literally «\/{0,}» # Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «{0,}» # Match the character “[” literally «\[» # Match a single digit 0..9 «\d+» # Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+» # Match the character “>” literally «>» subject = """this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>""" result = pattern.sub("", subject) print(result)
如果你想了解更多关于正则expression式,我build议阅读Jan Goyvaerts和Steven Levithan的正则expression式食谱 。
最简单的方法
import re txt='this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>' out = re.sub("(<[^>]+>)", '', txt) print out
string对象的replace方法不接受正则expression式,只接受固定string(参见文档: http : //docs.python.org/2/library/stdtypes.html#str.replace )。
你必须使用re
模块:
import re newline= re.sub("<\/?\[[0-9]+>", "", line)
不必使用正则expression式(对于您的示例string)
>>> s 'this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. \nand there are many other lines in the txt files\nwith<[3> such tags </[3>\n' >>> for w in s.split(">"): ... if "<" in w: ... print w.split("<")[0] ... this is a paragraph with in between and then there are cases ... where the number ranges from 1-100 . and there are many other lines in the txt files with such tags