为什么Java允许在其标识符中使用控制字符?
谜
在探索Java标识符允许使用哪些字符的时候,我偶然发现了一些非常好奇的东西,这几乎肯定是一个bug。
我期望能够findJava标识符符合这样的要求:它们以具有Unicode属性ID_Start
的字符ID_Start
,后面跟着具有ID_Continue
属性的ID_Continue
,并为引导下划线和美元符号授予例外。 事实并非如此,我所发现的是与我所听到的正常标识符的这种或任何其他想法的极端不同。
短演示
考虑以下演示,certificate在Java标识符中允许使用ASCII ESC字符(八进制033):
$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: \033"; System.out.println(var_\033); }})' > escape.java $ javac escape.java $ java escape | cat -v i am escape: ^[
不过,情况更糟糕。 事实上,几乎是无穷的。 即使NULL是允许的! 还有成千上万的其他代码点甚至不是标识符字符。 我已经在Solaris,Linux和运行Darwin的Mac上testing过了,并且都给出了相同的结果。
长演示
下面是一个testing程序,它将显示所有这些意想不到的代码点,Java是一个合法标识符名称的一部分。
#!/usr/bin/env perl # # test-java-idchars - find which bogus code points Java allows in its identifiers # # usage: test-java-idchars [low high] # eg: test-java-idchars 0 255 # # Without arguments, tests Unicode code points # from 0 .. 0x1000. You may go further with a # higher explicit argument. # # Produces a report at the end. # # You can ^C it prematurely to end the program then # and get a report of its progress up to that point. # # Tom Christiansen # tchrist@perl.com # Sat Jan 29 10:41:09 MST 2011 use strict; use warnings; use encoding "Latin1"; use open IO => ":utf8"; use charnames (); $| = 1; my @legal; my ($start, $stop) = (0, 0x1000); if (@ARGV != 0) { if (@ARGV == 1) { for (($stop) = @ARGV) { $_ = oct if /^0/; # support 0OCTAL, 0xHEX, 0bBINARY } } elsif (@ARGV == 2) { for (($start, $stop) = @ARGV) { $_ = oct if /^0/; } } else { die "usage: $0 [ [start] stop ]\n"; } } for my $cp ( $start .. $stop ) { my $char = chr($cp); next if $char =~ /[\s\w]/; my $type = "?"; for ($char) { $type = "Letter" if /\pL/; $type = "Mark" if /\pM/; $type = "Number" if /\pN/; $type = "Punctuation" if /\pP/; $type = "Symbol" if /\pS/; $type = "Separator" if /\pZ/; $type = "Control" if /\pC/; } my $name = $cp ? (charnames::viacode($cp) || "<missing>") : "NULL"; next if $name eq "<missing>" && $cp > 0xFF; my $msg = sprintf("U+%04X %s", $cp, $name); print "testing \\p{$type} $msg..."; open(TESTPROGRAM, ">:utf8", "testchar.java") || die $!; print TESTPROGRAM <<"End_of_Java_Program"; public class testchar { public static void main(String argv[]) { String var_$char = "variable name ends in $msg"; System.out.println(var_$char); } } End_of_Java_Program close(TESTPROGRAM) || die $!; system q{ ( javac -encoding UTF-8 testchar.java \ && \ java -Dfile.encoding=UTF-8 testchar | grep variable \ ) >/dev/null 2>&1 }; push @legal, sprintf("U+%04X", $cp) if $? == 0; if ($? && $? < 128) { print "<interrupted>\n"; exit; # from a ^C } printf "is %s in Java identifiers.\n", ($? == 0) ? uc "legal" : "forbidden"; } END { print "Legal but evil code points: @legal\n"; }
下面是一个运行该程序的例子,它仅仅是前面33个既不是空格也不是标识字符的代码点:
$ perl test-java-idchars 0 0x20 testing \p{Control} U+0000 NULL...is LEGAL in Java identifiers. testing \p{Control} U+0001 START OF HEADING...is LEGAL in Java identifiers. testing \p{Control} U+0002 START OF TEXT...is LEGAL in Java identifiers. testing \p{Control} U+0003 END OF TEXT...is LEGAL in Java identifiers. testing \p{Control} U+0004 END OF TRANSMISSION...is LEGAL in Java identifiers. testing \p{Control} U+0005 ENQUIRY...is LEGAL in Java identifiers. testing \p{Control} U+0006 ACKNOWLEDGE...is LEGAL in Java identifiers. testing \p{Control} U+0007 BELL...is LEGAL in Java identifiers. testing \p{Control} U+0008 BACKSPACE...is LEGAL in Java identifiers. testing \p{Control} U+000B LINE TABULATION...is forbidden in Java identifiers. testing \p{Control} U+000E SHIFT OUT...is LEGAL in Java identifiers. testing \p{Control} U+000F SHIFT IN...is LEGAL in Java identifiers. testing \p{Control} U+0010 DATA LINK ESCAPE...is LEGAL in Java identifiers. testing \p{Control} U+0011 DEVICE CONTROL ONE...is LEGAL in Java identifiers. testing \p{Control} U+0012 DEVICE CONTROL TWO...is LEGAL in Java identifiers. testing \p{Control} U+0013 DEVICE CONTROL THREE...is LEGAL in Java identifiers. testing \p{Control} U+0014 DEVICE CONTROL FOUR...is LEGAL in Java identifiers. testing \p{Control} U+0015 NEGATIVE ACKNOWLEDGE...is LEGAL in Java identifiers. testing \p{Control} U+0016 SYNCHRONOUS IDLE...is LEGAL in Java identifiers. testing \p{Control} U+0017 END OF TRANSMISSION BLOCK...is LEGAL in Java identifiers. testing \p{Control} U+0018 CANCEL...is LEGAL in Java identifiers. testing \p{Control} U+0019 END OF MEDIUM...is LEGAL in Java identifiers. testing \p{Control} U+001A SUBSTITUTE...is LEGAL in Java identifiers. testing \p{Control} U+001B ESCAPE...is LEGAL in Java identifiers. testing \p{Control} U+001C INFORMATION SEPARATOR FOUR...is forbidden in Java identifiers. testing \p{Control} U+001D INFORMATION SEPARATOR THREE...is forbidden in Java identifiers. testing \p{Control} U+001E INFORMATION SEPARATOR TWO...is forbidden in Java identifiers. testing \p{Control} U+001F INFORMATION SEPARATOR ONE...is forbidden in Java identifiers. Legal but evil code points: U+0000 U+0001 U+0002 U+0003 U+0004 U+0005 U+0006 U+0007 U+0008 U+000E U+000F U+0010 U+0011 U+0012 U+0013 U+0014 U+0015 U+0016 U+0017 U+0018 U+0019 U+001A U+001B
这是另一个演示:
$ perl test-java-idchars 0x600 0x700 | grep -i legal testing \p{Control} U+0600 ARABIC NUMBER SIGN...is LEGAL in Java identifiers. testing \p{Control} U+0601 ARABIC SIGN SANAH...is LEGAL in Java identifiers. testing \p{Control} U+0602 ARABIC FOOTNOTE MARKER...is LEGAL in Java identifiers. testing \p{Control} U+0603 ARABIC SIGN SAFHA...is LEGAL in Java identifiers. testing \p{Control} U+06DD ARABIC END OF AYAH...is LEGAL in Java identifiers. Legal but evil code points: U+0600 U+0601 U+0602 U+0603 U+06DD
问题
任何人都可以请解释这个看似疯狂的行为? U + 0000开头的地方有许多许多其他许多令人难以置信的代码点,这也许是最奇怪的。 如果您在第一个0x1000代码点上运行它,则会看到某些特定的模式,例如允许使用属性Current_Symbol
任何和所有代码点。 但是太多的东西是完全莫名其妙的,至less是我自己。
Java语言规范部分3.8遵循Character.isJavaIdentifierStart()和Character.isJavaIdentifierPart() 。 后者除其他条件外还具有Character.isIdentifierIgnorable() ,它允许非空白控制字符(包括整个C1范围,请参阅列表链接)。
另一个问题可能是:为什么Java不允许控制字符在其标识符中?
devise一种语言或其他系统的一个好原则就是不要因为没有好的原因而禁止任何事情,因为你永远不知道如何使用它,规则的实施者和用户必须应付得越less越好。
确实,你当然不应该利用这一点,通过实际将转义embedded到你的variables名称中,而且你将不会看到任何公开类中包含空字符的stream行库。
当然,这可能会被滥用,但这不是语言devise者的工作,不是通过强制合适的缩进或精心devise的variables名来保护程序员不受这种方式的影响。
我不明白有什么大不了的。 它如何影响你呢?
如果开发者想混淆他的代码,他可以用ASCII来完成。
如果一个开发者想让他的代码可以理解,他会使用行业的通用语言:英语。 不仅标识符只是ASCII,而且来自普通的英文单词。 否则,没有人会使用或阅读他的代码,他可以使用他喜欢的任何疯狂的angular色。