UTF8到/从STL宽字符转换

是否有可能将std :: string中的UTF8string转换为std :: wstring，反之亦然？在Windows应用程序中，我将使用MultiByteToWideChar和WideCharToMultiByte。但是，代码是为多个操作系统编译的，而我仅限于标准的C ++库。

5年前我问过这个问题。这个线程对我来说是非常有帮助的，于是我得出了一个结论，然后我继续我的项目。有趣的是，我最近需要类似的东西，与过去的项目完全无关。当我正在研究可能的解决scheme时，我偶然发现了自己的问题:)

我现在select的解决scheme是基于C ++ 11的。康斯坦丁在他的回答中提到的提升库现在是标准的一部分。如果我们用新的stringtypesstd :: u16stringreplacestd :: wstring，那么转换将如下所示：

UTF-8到UTF-16

std::string source; ... std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert; std::u16string dest = convert.from_bytes(source);

UTF-16到UTF-8

 std::u16string source; ... std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert; std::string dest = convert.to_bytes(source);

从其他答案可以看出，问题有多种解决方法。这就是为什么我不select接受的答案。

UTF8-CPP：使用C ++的便携式UTF-8

您可以从Boost序列化库中提取utf8_codecvt_facet 。

他们的用法例子：

  typedef wchar_t ucs4_t; std::locale old_locale; std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>); // Set a New global locale std::locale::global(utf8_locale); // Send the UCS-4 data out, converting to UTF-8 { std::wofstream ofs("data.ucd"); ofs.imbue(utf8_locale); std::copy(ucs4_data.begin(),ucs4_data.end(), std::ostream_iterator<ucs4_t,ucs4_t>(ofs)); } // Read the UTF-8 data back in, converting to UCS-4 on the way in std::vector<ucs4_t> from_file; { std::wifstream ifs("data.ucd"); ifs.imbue(utf8_locale); ucs4_t item = 0; while (ifs >> item) from_file.push_back(item); }

在boost源文件中查找utf8_codecvt_facet.hpp和utf8_codecvt_facet.cpp文件。

问题定义明确指出，8位字符编码是UTF-8。这使得这个微不足道的问题; 所需要的只是从一个UTF规范转换到另一个规范。

只需查看这些维基百科页面上的UTF-8 ， UTF-16和UTF-32的编码即可。

原理很简单 – 通过input并根据一个UTF规范组装一个32位的Unicode代码点，然后根据另一个规范发出代码点。各个代码点不需要任何翻译，就像任何其他字符编码所要求的那样; 这是什么使这个简单的问题。

这是一个快速实现wchar_t到UTF-8的转换，反之亦然。它假定input已经被正确编码了 – 老话说“垃圾进入，垃圾出来”在这里适用。我相信validation编码是最好的一个单独的步骤。

 std::string wchar_to_UTF8(const wchar_t * in) { std::string out; unsigned int codepoint = 0; for (in; *in != 0; ++in) { if (*in >= 0xd800 && *in <= 0xdbff) codepoint = ((*in - 0xd800) << 10) + 0x10000; else { if (*in >= 0xdc00 && *in <= 0xdfff) codepoint |= *in - 0xdc00; else codepoint = *in; if (codepoint <= 0x7f) out.append(1, static_cast<char>(codepoint)); else if (codepoint <= 0x7ff) { out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f))); out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f))); } else if (codepoint <= 0xffff) { out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f))); out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f))); out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f))); } else { out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07))); out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f))); out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f))); out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f))); } codepoint = 0; } } return out; }

上面的代码适用于UTF-16和UTF-32input，只是因为d800到dfff的范围是无效的代码点; 他们表示你正在解码UTF-16。如果你知道wchar_t是32位，那么你可以删除一些代码来优化函数。

 std::wstring UTF8_to_wchar(const char * in) { std::wstring out; unsigned int codepoint; while (*in != 0) { unsigned char ch = static_cast<unsigned char>(*in); if (ch <= 0x7f) codepoint = ch; else if (ch <= 0xbf) codepoint = (codepoint << 6) | (ch & 0x3f); else if (ch <= 0xdf) codepoint = ch & 0x1f; else if (ch <= 0xef) codepoint = ch & 0x0f; else codepoint = ch & 0x07; ++in; if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) { if (sizeof(wchar_t) > 2) out.append(1, static_cast<wchar_t>(codepoint)); else if (codepoint > 0xffff) { out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10))); out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff))); } else if (codepoint < 0xd800 || codepoint >= 0xe000) out.append(1, static_cast<wchar_t>(codepoint)); } } return out; }

同样，如果你知道wchar_t是32位，你可以从这个函数中删除一些代码，但是在这种情况下，它应该没有任何区别。 expression式sizeof(wchar_t) > 2在编译时是已知的，所以任何体面的编译器都会识别死代码并将其删除。

有几种方法可以做到这一点，但结果取决于字符编码在string和wstringvariables中的含义。

如果你知道string是ASCII，你可以简单地使用wstring的迭代器构造函数：

 string s = "This is surely ASCII."; wstring w(s.begin(), s.end());

如果你的string有其他编码，但是，你会得到非常糟糕的结果。如果编码是Unicode，那么可以看看ICU项目，该项目提供了跨平台的一组库，可以从各种Unicode编码中进行转换。

如果你的string在代码页中包含字符，那么$ DEITY可能会怜悯你的灵魂。

ConvertUTF.h ConvertUTF.c

感谢bames53提供更新的版本

您可以使用codecvt语言环境构面。有一个特定的特殊定义， codecvt<wchar_t, char, mbstate_t>可能对您有用，不过，它的行为是特定于系统的，并不保证以任何方式转换为UTF-8。

UTFConverter – 检查这个库。它做了这样的转换，但你还需要ConvertUTF类 – 我在这里find它

我不认为有这样做的便携式的方式。 C ++不知道其多字节字符的编码。

正如克里斯所build议的，你最好的select是使用codecvt。

UTF8到/从STL宽字符转换

通过PrimeFacesinput组件检索的Unicodeinput已损坏

Spring MVC中的UTF-8编码，FORMs问题

如何检测read.csv的正确编码？

Unicode，UTF，ASCII，ANSI格式的区别

如何在Eclipse中支持UTF-8编码

MySQL变音不敏感search（西class牙口音）

如何检测文本文件的字符编码？

Python：将Unicode转换为ASCII而不会出错

如何使用JAX-RS设置字符集？

UTF-8“可变宽度编码”如何工作？