有没有办法摆脱重音，并将整个string转换为常规字母？

除了使用String.replaceAll()方法并逐个replace字母之外，是否还有更好的方法来消除重音，并使这些字母有规律地String.replaceAll() ？例：

input： orčpžsíáýd

输出： orcpzsiayd

它不需要包括所有带有俄文字母或中文字母的字母。

使用java.text.Normalizer来为你处理。

 string = Normalizer.normalize（string，Normalizer.Form.NFD）;

这会将字符中的所有重音符号分开。然后，你只需要比较每个字符与一个字母，扔掉那些不是。

 string = string.replaceAll（“[^ \\ p {ASCII}]”，“”）;

如果你的文本是unicode，你应该用这个代替：

 string = string.replaceAll（“\\ p {M}”，“”）;

对于unicode， \\P{M}匹配基本字形， \\p{M} （小写）匹配每个重音。

感谢GarretWilson为指针和regular-expressions.info为伟大的Unicode指南。

截至2011年，您可以使用Apache Commons的StringUtils.stripAccents（input）（从3.0开始）：

  String input = StringUtils.stripAccents("Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ"); System.out.println(input); // Prints "This is a funky String"

注意：

接受的答案（Erick Robertson）不适用于Ø或者Ł。 Apache Commons 3.5对Ø也不起作用，但对于英国来说它确实有效。在阅读维基百科关于Ø的文章之后，我不确定它应该被replace为“O”：它是挪威语和丹麦语中的一个单独的字母，按照字母“z”。这是“地带口音”方法局限性的一个很好的例子。

由@ virgo47的解决scheme是非常快，但近似。接受的答案使用Normalizer和一个正则expression式。我想知道Normalizer与正则expression式所占的时间是多less，因为删除所有的非ASCII字符可以不用正则expression式：

 import java.text.Normalizer; public class Strip { public static String flattenToAscii(String string) { StringBuilder sb = new StringBuilder(string.length()); string = Normalizer.normalize(string, Normalizer.Form.NFD); for (char c : string.toCharArray()) { if (c <= '\u007F') sb.append(c); } return sb.toString(); } }

通过写入一个char []而不是调用toCharArray（）可以获得额外的加速，尽pipe我不确定代码清晰度的降低是否值得：

 public static String flattenToAscii(String string) { char[] out = new char[string.length()]; string = Normalizer.normalize(string, Normalizer.Form.NFD); int j = 0; for (int i = 0, n = string.length(); i < n; ++i) { char c = string.charAt(i); if (c <= '\u007F') out[j++] = c; } return new String(out); }

这种变化的优点是使用Normalizer的正确性和使用表的速度的一些。在我的机器上，这个比接受的答案快了4倍，比@ virgo47慢了6.6倍到7倍（接受的答案比我机器上的@ virgo47慢了26倍）。

编辑：如果你没有被卡在Java <6和速度不重要和/或翻译表太有限，使用David的答案。关键是在循环内使用Normalizer （在Java 6中引入）而不是转换表。

虽然这不是“完美”的解决scheme，但是当你知道范围（在我们的例子中是Latin1,2），在Java6之前工作（虽然不是真正的问题），并且比大多数build议版本（可能或可能不是一个问题）：

  /** * Mirror of the unicode table from 00c0 to 017f without diacritics. */ private static final String tab00c0 = "AAAAAAACEEEEIIII" + "DNOOOOO\u00d7\u00d8UUUUYI\u00df" + "aaaaaaaceeeeiiii" + "\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" + "AaAaAaCcCcCcCcDd" + "DdEeEeEeEeEeGgGg" + "GgGgHhHhIiIiIiIi" + "IiJjJjKkkLlLlLlL" + "lLlNnNnNnnNnOoOo" + "OoOoRrRrRrSsSsSs" + "SsTtTtTtUuUuUuUu" + "UuUuWwYyYZzZzZzF"; /** * Returns string without diacritics - 7 bit approximation. * * @param source string to convert * @return corresponding string without diacritics */ public static String removeDiacritic(String source) { char[] vysl = new char[source.length()]; char one; for (int i = 0; i < source.length(); i++) { one = source.charAt(i); if (one >= '\u00c0' && one <= '\u017f') { one = tab00c0.charAt((int) one - '\u00c0'); } vysl[i] = one; } return new String(vysl); }

在我的硬件上使用32位JDK进行的testing显示，在100ms内执行从100次到100次的转换，而Normalizer方式则在3.7秒（37倍）下执行。如果你的需求是性能和你知道input范围，这可能是你的。

请享用：-）

 System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));

为我工作。上面的代码片段的输出给出了我想要的“aee”，但是

 System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""));

没有做任何替代。

根据不同的语言，这些可能不被认为是口音（改变了字母的声音），而是变音符号

https://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics

“波斯尼亚和克罗地亚的符号为č，ć，đ，š和ž，这些符号被认为是分开的字母，在字典和按照字母顺序列出字的其他语境中列出。

删除它们可能会固有地改变这个词的含义，或者将这些字母变成完全不同的字母。

@David康拉德解决scheme是我尝试使用Normalizer的最快，但它确实有一个错误。它基本上去除了不重音的字符，例如汉字和其他字母，如æ，都被剥离了。我们想要剥离的字符是非间距标记，在最终string中不占用额外宽度的字符。这些零宽度字符基本上结合在一些其他字符。如果你能看到他们被孤立为一个angular色，比如像这样`，我的猜测就是它与空间angular色相结合。

 public static String flattenToAscii(String string) { char[] out = new char[string.length()]; String norm = Normalizer.normalize(string, Normalizer.Form.NFD); int j = 0; for (int i = 0, n = norm.length(); i < n; ++i) { char c = norm.charAt(i); int type = Character.getType(c); //Log.d(TAG,""+c); //by Ricardo, modified the character check for accents, ref: http://stackoverflow.com/a/5697575/689223 if (type != Character.NON_SPACING_MARK){ out[j] = c; j++; } } //Log.d(TAG,"normalized string:"+norm+"/"+new String(out)); return new String(out); }

我遇到了与string相等检查相关的问题，比较string之一是ASCII字符码128-255 。

即非破坏空间 – [Hex-A0]空间[Hex – 20]。通过HTML显示不间断的空间。我已经使用了下面的spacing entities 。它们的字符和字节就像&emsp is very wide space[ ]{-30, -128, -125}, &ensp is somewhat wide space[ ]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {}
 String s1 = "My Sample Space Data", s2 = "My Sample Space Data"; System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes())); System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes())); 
字节输出：

S1：[77，121，32，83，97，109，112，108，101，32，83，112，97，99，101，32，68，97，116，97] S2：[77， -30, -128, -125 ， -30, -128, -125 ， -30, -128, -125 ，97,116,97]

使用下面的代码为不同的空间和他们的字节代码： wiki for List_of_Unicode_characters

 String spacing_entities = "very wide space,narrow space,regular space,invisible separator"; System.out.println("Space String :"+ spacing_entities); byte[] byteArray = // spacing_entities.getBytes( Charset.forName("UTF-8") ); // Charset.forName("UTF-8").encode( s2 ).array(); {-30, -128, -125, 44, -30, -128, -126, 44, 32, 44, -62, -96}; System.out.println("Bytes:"+ Arrays.toString( byteArray ) ); try { System.out.format("Bytes to String[%S] \n ", new String(byteArray, "UTF-8")); } catch (UnsupportedEncodingException e) { e.printStackTrace(); }

Java的Unicodestring的ASCII音译。 unidecode
```
 String initials = Unidecode.decode( s2 ); 
```

➩使用Guava ：Google Core Libraries for Java 。

 String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " );

对于空间使用番石榴laibrary的URL编码。

 String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString);

为了克服这个问题，使用String.replaceAll()和一些RegularExpression 。

 // \p{Z} or \p{Separator}: any kind of whitespace or invisible separator. s2 = s2.replaceAll("\\p{Zs}", " "); s2 = s2.replaceAll("[^\\p{ASCII}]", " "); s2 = s2.replaceAll(" ", " ");

➩使用java.text.Normalizer.Form 。此枚举提供Unicode标准附录＃15 – Unicode标准化表单中描述的四种Unicode标准化表单的常量，以及两种访问它们的方法。
```
 s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC); 
```

testingstring和输出不同的方法，如➩Unidecode，Normalizer， StringUtils 。

 String strUni = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß"; // This is a funky String AE,O,D,ss String initials = Unidecode.decode( strUni ); // Following Produce this o/p: Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß String temp = Normalizer.normalize(strUni, Normalizer.Form.NFD); Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+"); temp = pattern.matcher(temp).replaceAll(""); String input = org.apache.commons.lang3.StringUtils.stripAccents( strUni );

使用Unidecode是best choice ，我的最终代码如下所示。

 public static void main(String[] args) { String s1 = "My Sample Space Data", s2 = "My Sample Space Data"; String initials = Unidecode.decode( s2 ); if( s1.equals(s2)) { //[ , ] %A0 - %2C - %20 « http://www.ascii-code.com/ System.out.println("Equal Unicode Strings"); } else if( s1.equals( initials ) ) { System.out.println("Equal Non Unicode Strings"); } else { System.out.println("Not Equal"); } }

我build议Junidecode 。它不仅可以处理“Ł”和“'”，而且还可以很好地从其他字母（如中文）转换成拉丁字母。

有没有办法摆脱重音，并将整个string转换为常规字母？

如何防止诸如Zalgo文本的变音符号

Microsoft Excel在.csv文件中损坏变音符号？

如何在SQLite查询中忽略重音（Android）

我应该在url中使用重音字符吗？

将符号，口音字母转换为英文字母

在JavaScript中删除string中的重音/变音符号

匹配任何非单词字符（不包括变音符号）

如何改变变音符号为非变音符号

Javastringsearch忽略重音

删除Python unicodestring中的重音符号的最佳方法是什么？