从Javastring中去除所有不可打印字符的最快方法

什么是从Java中的String中去除所有不可打印字符的最快方法？

到目前为止，我已经尝试和测量了138字节，131字符的string：

String的replaceAll() – 最慢的方法
- 517009结果/秒
预编译模式，然后使用匹配器的replaceAll()
- 637836结果/秒
使用StringBuffer，使用codepointAt()逐个获取代码codepointAt()并追加到StringBuffer
- 711946结果/秒
使用StringBuffer，使用charAt()逐个获取字符并追加到StringBuffer
- 1052964结果/秒
预先分配一个char[]缓冲区，使用charAt()逐个获取字符并填充此缓冲区，然后转换回string
- 2022653结果/秒
预先分配2个char[]缓冲区 – 旧的和新的，使用getChars()立即获取现有string的所有字符， getChars()迭代旧缓冲区并填充新缓冲区，然后将新缓冲区转换为string – 我自己的最快版本
- 每秒2502502个结果
与2缓冲区相同的东西 – 只使用byte[] ， getBytes()和指定编码为“utf-8”
- 857485结果/秒
与2 byte[]缓冲区相同的东西，但指定编码作为一个常量Charset.forName("utf-8")
- 791076结果/秒
与2 byte[]缓冲区相同的东西，但指定编码为1字节的本地编码（只是一个理智的事情）
- 370164结果/秒

我最好的尝试是以下几点：

  char[] oldChars = new char[s.length()]; s.getChars(0, s.length(), oldChars, 0); char[] newChars = new char[s.length()]; int newLen = 0; for (int j = 0; j < s.length(); j++) { char ch = oldChars[j]; if (ch >= ' ') { newChars[newLen] = ch; newLen++; } } s = new String(newChars, 0, newLen);

任何想法如何使其更快？

回答一个非常奇怪的问题的奖励点：为什么使用“utf-8”字符集名称直接产生比使用预先分配的静态常量Charset.forName("utf-8")更好的性能？

更新

从棘轮怪胎的build议产生令人印象深刻的3105590结果/秒的performance，+ 24％的改善！
Ed Staub的build议产生了另一个改进 – 每秒3471017个结果，比之前的最佳结果增加了12％。

更新2

我尽我所能收集了所有提出的解决scheme及其交叉突变，并将其作为github上的一个小型基准testing框架发布。目前它运行17个algorithm。其中之一是“特殊的” –Voo1algorithm（由SO用户Voo提供）采用复杂的reflection技巧，从而达到恒定的速度，但它扰乱了JVMstring的状态，因此它是分开的基准。

欢迎您查看并运行它，以确定您的盒子上的结果。以下是对我的结果的总结。它的规格：

Debian sid
Linux 2.6.39-2-amd64（x86_64）
从sun-java6-jdk-6.24-1软件包安装的Java，JVM将自己标识为
- Java（TM）SE运行时环境（build 1.6.0_24-b07）
- Java HotSpot（TM）64位服务器虚拟机（构build19.1-b02，混合模式）

给定一组不同的input数据，不同的algorithm会显示最终的不同结果。我已经跑了3种模式的基准：

同一个string

此模式在StringSource类提供的同一个单一string上作为常量运行。摊牌是：

  Ops / s│algorithm
 ──────────┼──────────────────────────────
 6 535 947│Voo1
 ──────────┼──────────────────────────────
 5 350 454│RatchetFreak2EdStaub1GreyCat1
 5 249 343│EdStaub1
 5 002 501│EdStaub1GreyCat1
 4 859 086│ArrayOfCharFromStringCharAt
 4 295 532│RatchetFreak1
 4 045 307│ArrayOfCharFromArrayOfChar
 2 790 178│RatchetFreak2EdStaub1GreyCat2
 2 583 311│RatchetFreak2
 1 274 859│StringBuilderChar
 1 138 174│StringBuilderCodePoint
   994 727│ArrayOfByteUTF8String
   918 611│ArrayOfByteUTF8Const
   756 086│MatcherReplace
   598 945│StringReplaceAll
   460 045│ArrayOfByteWindows1251

以图表forms：相同的单个string图表http://www.greycat.ru/img/os-chart-single.png

多个string，100％的string包含控制字符

源string提供程序使用（0..127）字符集预先生成大量随机string，因此几乎所有string都至less包含一个控制字符。 algorithm以循环方式从这个预先生成的数组中接收string。

  Ops / s│algorithm
 ──────────┼──────────────────────────────
 2 123 142│Voo1
 ──────────┼──────────────────────────────
 1 782 214│EdStaub1
 1 776 199│EdStaub1GreyCat1
 1 694 628│ArrayOfCharFromStringCharAt
 1 481 481│ArrayOfCharFromArrayOfChar
 1 460 067│RatchetFreak2EdStaub1GreyCat1
 1 438 435│RatchetFreak2EdStaub1GreyCat2
 1 366 494│RatchetFreak2
 1 349 710│RatchetFreak1
   893 176│ArrayOfByteUTF8String
   817 127│ArrayOfByteUTF8Const
   778 089│StringBuilderChar
   734 754│StringBuilderCodePoint
   377 829│ArrayOfByteWindows1251
   224 140│MatcherReplace
   211 104│StringReplaceAll

在图表forms：多个string，100％的集中http://www.greycat.ru/img/os-chart-multi100.png

多个string，1％的string包含控制字符

与以前相同，只有1％的string是由控制字符生成的，其他的99％是在使用[32..127]字符集时生成的，所以它们根本不能包含控制字符。这个综合负载在我的地方最接近这个algorithm的实际应用。

  Ops / s│algorithm
 ──────────┼──────────────────────────────
 3 711 952│Voo1
 ──────────┼──────────────────────────────
 2 851 440│EdStaub1GreyCat1
 2 455 796│EdStaub1
 2 426 007│ArrayOfCharFromStringCharAt
 2 347 969│RatchetFreak2EdStaub1GreyCat2
 2 242 152│RatchetFreak1
 2 171 553│ArrayOfCharFromArrayOfChar
 1 922 707│RatchetFreak2EdStaub1GreyCat1
 1 857 010│RatchetFreak2
 1 023 751│ArrayOfByteUTF8String
   939 055│StringBuilderChar
   907 194│ArrayOfByteUTF8Const
   841 963│StringBuilderCodePoint
   606 465│MatcherReplace
   501 555│StringReplaceAll
   381 185│ArrayOfByteWindows1251

以图表的forms：多个string，1％的集中http://www.greycat.ru/img/os-chart-multi1.png

我很难决定谁提供了最好的答案，但考虑到现实世界的应用，最好的解决scheme是由Ed Staub给予/启发的，我想这是公平的标记他的答案。感谢所有参与此事的人，您的意见非常有帮助和宝贵。随意在你的机器上运行testing套件，并提出更好的解决scheme（工作JNI解决scheme，任何人？）。

参考

GitHub仓库与基准testing套件

如果将此方法embedded到不在线程中共享的类中是合理的，则可以重新使用缓冲区：

 char [] oldChars = new char[5]; String stripControlChars(String s) { final int inputLen = s.length(); if ( oldChars.length < inputLen ) { oldChars = new char[inputLen]; } s.getChars(0, inputLen, oldChars, 0);

等等…

这是一个很大的胜利 – 20％左右，因为我了解当前最好的情况。

如果要在潜在的大型string上使用这个内存，并且需要考虑内存“泄漏”，则可以使用弱引用。

使用1个字符数组可能会更好一些

 int length = s.length(); char[] oldChars = new char[length]; s.getChars(0, length, oldChars, 0); int newLen = 0; for (int j = 0; j < length; j++) { char ch = oldChars[j]; if (ch >= ' ') { oldChars[newLen] = ch; newLen++; } } s = new String(oldChars, 0, newLen);

我避免了重复调用s.length();

另一个可能起作用的微型优化是

 int length = s.length(); char[] oldChars = new char[length+1]; s.getChars(0, length, oldChars, 0); oldChars[length]='\0';//avoiding explicit bound check in while int newLen=-1; while(oldChars[++newLen]>=' ');//find first non-printable, // if there are none it ends on the null char I appended for (int j = newLen; j < length; j++) { char ch = oldChars[j]; if (ch >= ' ') { oldChars[newLen] = ch;//the while avoids repeated overwriting here when newLen==j newLen++; } } s = new String(oldChars, 0, newLen);

那么我已经打败了目前最好的方法（怪胎的解决scheme与预分配arrays）约30％根据我的措施。怎么样？通过卖我的灵魂。

我敢肯定，到目前为止，所有跟随讨论的人都知道这违反了几乎任何基本的编程原则，但是噢。不pipe怎样，如果string的使用字符数组不是其他string之间共享的话，那么以下方法才有效 – 如果它确实需要debugging的话，那么所有权利都将决定杀死你（没有调用substring（）并在string上使用这个string这应该工作，因为我不明白为什么JVM会实习独特的string从外部来源读取）。虽然不要忘记确保基准代码不会这样做 – 这是极有可能的，而且会明显地帮助reflection解决scheme。

无论如何，我们去：

  // Has to be done only once - so cache those! Prohibitively expensive otherwise private Field value; private Field offset; private Field count; private Field hash; { try { value = String.class.getDeclaredField("value"); value.setAccessible(true); offset = String.class.getDeclaredField("offset"); offset.setAccessible(true); count = String.class.getDeclaredField("count"); count.setAccessible(true); hash = String.class.getDeclaredField("hash"); hash.setAccessible(true); } catch (NoSuchFieldException e) { throw new RuntimeException(); } } @Override public String strip(final String old) { final int length = old.length(); char[] chars = null; int off = 0; try { chars = (char[]) value.get(old); off = offset.getInt(old); } catch(IllegalArgumentException e) { throw new RuntimeException(e); } catch(IllegalAccessException e) { throw new RuntimeException(e); } int newLen = off; for(int j = off; j < off + length; j++) { final char ch = chars[j]; if (ch >= ' ') { chars[newLen] = ch; newLen++; } } if (newLen - off != length) { // We changed the internal state of the string, so at least // be friendly enough to correct it. try { count.setInt(old, newLen - off); // Have to recompute hash later on hash.setInt(old, 0); } catch(IllegalArgumentException e) { e.printStackTrace(); } catch(IllegalAccessException e) { e.printStackTrace(); } } // Well we have to return something return old; }

对于我的testingstring，得到3477148.18ops/s与旧版本的2616120.89ops/s 。我相当确信唯一的方法就是把它写成C语言（可能不是），或者到目前为止还没有人想到的完全不同的方法。虽然我绝对不确定在不同平台上的时序是否稳定 – 至less在我的机器上（Java7，Win7 x64）产生可靠的结果。

您可以将任务分成几个并行的子任务，具体取决于处理器的数量。

IANA低级别的Java性能瘾君子，但你有没有试过展开你的主循环？看来它可能允许一些CPU并行执行检查。

此外，这有一些有趣的想法优化。

我很自由，为不同的algorithm写了一个小的基准。这并不完美，但是我使用一个随机string（默认情况下大约32/200％的非printable表示）对给定algorithm运行至less1000次10000次。这应该照顾像GC，初始化等东西 – 没有太多的开销，任何algorithm不应该有至less有一个没有多less障碍的运行。

没有特别好logging，但哦。在这里我们去了 – 我包括棘轮怪胎的algorithm和基本版本。目前我随机初始化一个长度为200个字符的string，其字符的范围是[0，200]。

为什么使用“utf-8”字符集名称直接比使用预分配的静态常量Charset.forName（“utf-8”）产生更好的性能？

如果你的意思是String#getBytes("utf-8")等等：这应该不会更快 – 除了一些更好的caching – 因为Charset.forName("utf-8")在内部使用，如果charset没有被caching。

有一件事可能是你正在使用不同的字符集（或者你的代码可能透明），但在StringCoding中caching的字符集不会改变。

从Javastring中去除所有不可打印字符的最快方法

更新

更新2

同一个string

多个string，100％的string包含控制字符

多个string，1％的string包含控制字符

参考

知道C实际上会伤害你在高级语言中编写的代码吗？

加快茱莉亚写得不好的R例子

并行化：pthreads还是OpenMP？

为什么在遍历2D数组时，循环的顺序会影响性能？

使string小写的第一个字符最有效的方法？

有什么情况下最好使用无条件AND（而不是&&）

algorithm获得一个数字的Excel类列名称

expression式“j = ++（i | i）; j = ++（i＆i）;应该是一个左值错误？

在x86上执行水平浮点vector和的最快方法

Minify ASP.NET应用程序的Html输出