正则expression式在Java中没有明显的最大长度
我一直认为Java的正则expression式API(以及许多其他语言)中的隐式断言必须有一个明显的长度。 所以,STAR和PLUS量词不允许放在后面 。
优秀的在线资源regular-expressions.info似乎证实了我的一些假设:
Java让事情更进一步,允许有限的重复,你仍然不能使用星号或者加号,但是你可以使用问号和大括号来指定最大参数,Java认识到有限重复可以改写为不同长度固定长度的string,不幸的是,JDK 1.4和1.5在向后看内部使用交替时会有一些错误,这些错误在JDK 1.6中得到修复[…]
– http://www.regular-expressions.info/lookaround.html
只要外观内部字符的总长度小于或等于Integer.MAX_VALUE,就可以使用大括号。 所以这些正则expression式是有效的:
"(?<=a{0," +(Integer.MAX_VALUE) + "})B" "(?<=Ca{0," +(Integer.MAX_VALUE-1) + "})B" "(?<=CCa{0," +(Integer.MAX_VALUE-2) + "})B"
但是这些不是:
"(?<=Ca{0," +(Integer.MAX_VALUE) +"})B" "(?<=CCa{0," +(Integer.MAX_VALUE-1) +"})B"
不过,我不明白以下几点:
当我在外观内部使用*和+量词运行testing时,一切正常(请参见输出testing1和testing2 )。
但是,当我在testing1和testing2的后视图开始处添加单个字符时,它会中断(请参见输出testing3 )。
从testing3中得到的贪婪*不会产生任何影响,它仍然会中断(见testing4 )。
以下是testing工具:
public class Main { private static String testFind(String regex, String input) { try { boolean returned = java.util.regex.Pattern.compile(regex).matcher(input).find(); return "testFind : Valid -> regex = "+regex+", input = "+input+", returned = "+returned; } catch(Exception e) { return "testFind : Invalid -> "+regex+", "+e.getMessage(); } } private static String testReplaceAll(String regex, String input) { try { String returned = input.replaceAll(regex, "FOO"); return "testReplaceAll : Valid -> regex = "+regex+", input = "+input+", returned = "+returned; } catch(Exception e) { return "testReplaceAll : Invalid -> "+regex+", "+e.getMessage(); } } private static String testSplit(String regex, String input) { try { String[] returned = input.split(regex); return "testSplit : Valid -> regex = "+regex+", input = "+input+", returned = "+java.util.Arrays.toString(returned); } catch(Exception e) { return "testSplit : Invalid -> "+regex+", "+e.getMessage(); } } public static void main(String[] args) { String[] regexes = {"(?<=a*)B", "(?<=a+)B", "(?<=Ca*)B", "(?<=Ca*?)B"}; String input = "CaaaaaaaaaaaaaaaBaaaa"; int test = 0; for(String regex : regexes) { test++; System.out.println("********************** Test "+test+" **********************"); System.out.println(" "+testFind(regex, input)); System.out.println(" "+testReplaceAll(regex, input)); System.out.println(" "+testSplit(regex, input)); System.out.println(); } } }
输出:
********************** Test 1 ********************** testFind : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true testReplaceAll : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa testSplit : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa] ********************** Test 2 ********************** testFind : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true testReplaceAll : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa testSplit : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa] ********************** Test 3 ********************** testFind : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6 (?<=Ca*)B ^ testReplaceAll : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6 (?<=Ca*)B ^ testSplit : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6 (?<=Ca*)B ^ ********************** Test 4 ********************** testFind : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7 (?<=Ca*?)B ^ testReplaceAll : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7 (?<=Ca*?)B ^ testSplit : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7 (?<=Ca*?)B ^
我的问题可能很明显,但我仍然会问: 任何人都可以向我解释为什么testing1和2失败,而testing3和4不是? 我预料他们都会失败,其中一半不能工作,一半失败。
谢谢。
PS。 我正在使用:Java版本1.6.0_14
看一下Pattern.java的源代码,发现“*”和“+”被实现为Curly(它是为curl操作符创build的对象)的实例。 所以,
a*
被执行为
a{0,0x7FFFFFFF}
和
a+
被执行为
a{1,0x7FFFFFFF}
这就是为什么你看到完全相同的行为,为卷发和星星。
这是一个错误: http : //bugs.sun.com/view_bug.do?bug_id=6695369
如果无法确定lookbehind匹配的最大可能长度, Pattern.compile()
总是会引发exception。