如何在C#中获得string的一致字节表示而不需要手动指定编码?
如何在.NET(C#)中将string
转换为byte[]
而无需手动指定特定的编码?
我要encryptionstring。 我可以在不转换的情况下对它进行encryption,但是我仍然想知道为什么编码会在这里播放。 只要给我的字节是我说的。
另外,为什么要考虑编码? 我不能简单地得到string已被存储在什么字节? 为什么对字符编码有依赖性?
与这里的答案相反, 如果字节不需要被解释, 则不需要担心编码。
就像你提到的那样,你的目标就是“获取string存储的字节数” 。
(当然,也可以从字节中重新构造string。)
对于这些目标,我真的不明白为什么人们总是告诉你,你需要的编码。 你当然不需要担心这个编码。
只要做到这一点:
static byte[] GetBytes(string str) { byte[] bytes = new byte[str.Length * sizeof(char)]; System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length); return bytes; } static string GetString(byte[] bytes) { char[] chars = new char[bytes.Length / sizeof(char)]; System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length); return new string(chars); }
只要您的程序(或其他程序)不尝试以某种方式解释字节,而您显然没有提到您打算这样做,那么这种方法没有任何问题! 担心编码只会让你的生活更加复杂,没有真正的原因。
这种方法的其他好处:
这个string是否包含无效字符并不重要,因为仍然可以获取数据并重build原始string!
它将被编码和解码,因为你只是在看字节 。
如果你使用了特定的编码,它会给编码/解码无效字符带来麻烦。
这取决于你的string( ASCII , UTF-8 ,…)的编码。
例如:
byte[] b1 = System.Text.Encoding.UTF8.GetBytes (myString); byte[] b2 = System.Text.Encoding.ASCII.GetBytes (myString);
一个小样本为什么编码很重要:
string pi = "\u03a0"; byte[] ascii = System.Text.Encoding.ASCII.GetBytes (pi); byte[] utf8 = System.Text.Encoding.UTF8.GetBytes (pi); Console.WriteLine (ascii.Length); //Will print 1 Console.WriteLine (utf8.Length); //Will print 2 Console.WriteLine (System.Text.Encoding.ASCII.GetString (ascii)); //Will print '?'
ASCII只是没有配备处理特殊字符。
在内部,.NET框架使用UTF-16来表示string,所以如果你只是想获得.NET使用的确切字节,可以使用System.Text.Encoding.Unicode.GetBytes (...)
。
请参阅.NET Framework (MSDN) 中的字符编码以获取更多信息。
被接受的答案非常非常复杂。 使用包含的.NET类为此:
const string data = "A string with international characters: Norwegian: ÆØÅæøå, Chinese: 喂 谢谢"; var bytes = System.Text.Encoding.UTF8.GetBytes(data); var decoded = System.Text.Encoding.UTF8.GetString(bytes);
不要重新发明轮子,如果你不必… …
BinaryFormatter bf = new BinaryFormatter(); byte[] bytes; MemoryStream ms = new MemoryStream(); string orig = "喂 Hello 谢谢 Thank You"; bf.Serialize(ms, orig); ms.Seek(0, 0); bytes = ms.ToArray(); MessageBox.Show("Original bytes Length: " + bytes.Length.ToString()); MessageBox.Show("Original string Length: " + orig.Length.ToString()); for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo encrypt for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo decrypt BinaryFormatter bfx = new BinaryFormatter(); MemoryStream msx = new MemoryStream(); msx.Write(bytes, 0, bytes.Length); msx.Seek(0, 0); string sx = (string)bfx.Deserialize(msx); MessageBox.Show("Still intact :" + sx); MessageBox.Show("Deserialize string Length(still intact): " + sx.Length.ToString()); BinaryFormatter bfy = new BinaryFormatter(); MemoryStream msy = new MemoryStream(); bfy.Serialize(msy, sx); msy.Seek(0, 0); byte[] bytesy = msy.ToArray(); MessageBox.Show("Deserialize bytes Length(still intact): " + bytesy.Length.ToString());
您需要考虑编码,因为1个字符可以由1个或更多字节(最多约6个)表示,不同的编码将以不同的方式处理这些字节。
乔尔有一个这样的post:
绝对最低限度的每个软件开发人员绝对,积极地必须知道Unicode和字符集(没有借口!)
这是一个受欢迎的问题。 理解作者所问的问题是非常重要的,这与最常见的需求是不同的。 为了防止在不需要的地方滥用代码,我先回答后面的代码。
共同的需要
每个string都有一个字符集和编码。 当您将System.String
对象转换为System.Byte
的数组时,您仍然有一个字符集和编码。 对于大多数用途,您可以知道您需要哪种字符集和编码,而.NET可以“简单地转换”。 只要select合适的Encoding
类。
// using System.Text; Encoding.UTF8.GetBytes(".NET String to byte array")
转换可能需要处理目标字符集或编码不支持源代码中的字符的情况。 你有一些select:例外,replace或跳过。 默认策略是replace“?”。
// using System.Text; var text = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes("You win €100")); // -> "You win ?100"
显然,转换不一定是无损的!
注意:对于System.String
,源字符集是Unicode。
唯一令人困惑的是.NET使用字符集的名称作为该字符集的一种特定编码的名称。 Encoding.Unicode
应该被称为Encoding.UTF16
。
这是大多数用途。 如果这是你所需要的,请停下来在这里阅读。 如果你不明白什么是编码,请参阅有趣的Joel Spolsky文章 。
特定需求
现在,问题作者问道:“每个string都是以字节数组的forms存储的,对吗?为什么我不能简单地使用这些字节呢?
他不想要任何转换。
从C#规范 :
C#中的字符和string处理使用Unicode编码。 chartypes表示一个UTF-16代码单元,而stringtypes表示一系列UTF-16代码单元。
所以,我们知道,如果我们要求空转换(即从UTF-16到UTF-16),我们会得到期望的结果:
Encoding.Unicode.GetBytes(".NET String to byte array")
但为避免提及编码,我们必须以另一种方式来做。 如果一个中间数据types是可以接受的,这里有一个概念上的捷径:
".NET String to byte array".ToCharArray()
这没有得到我们想要的数据types,但是Mehrdad的答案显示了如何使用BlockCopy将此Char数组转换为Byte数组。 但是,这会复制两次string! 而且,它也明确地使用编码特定的代码:数据typesSystem.Char
。
获取string存储的实际字节的唯一方法是使用指针。 fixed
语句允许获取值的地址。 从C#规范:
[用于]stringtypes的expression式,…初始值设定项计算string中第一个字符的地址。
为此,编译器使用RuntimeHelpers.OffsetToStringData
将代码跳过string对象的其他部分。 因此,要获得原始字节,只需创build一个指向string的指针并复制所需的字节数。
// using System.Runtime.InteropServices unsafe byte[] GetRawBytes(String s) { if (s == null) return null; var codeunitCount = s.Length; /* We know that String is a sequence of UTF-16 codeunits and such codeunits are 2 bytes */ var byteCount = codeunitCount * 2; var bytes = new byte[byteCount]; fixed(void* pRaw = s) { Marshal.Copy((IntPtr)pRaw, bytes, 0, byteCount); } return bytes; }
正如@CodesInChaos指出的那样,结果取决于机器的字节顺序。 但问题作者并不关心这一点。
为了certificateMehrdrad的答案是有效的,他的方法甚至可以坚持不成对的代理angular色 (许多人反对我的回答,但每个人都同样有罪,例如System.Text.Encoding.UTF8.GetBytes
, System.Text.Encoding.Unicode.GetBytes
;这些编码方法不能坚持代名高的字符d800
,例如,只是用值fffd
代替高代理字符:
using System; class Program { static void Main(string[] args) { string t = "爱虫"; string s = "Test\ud800Test"; byte[] dumpToBytes = GetBytes(s); string getItBack = GetString(dumpToBytes); foreach (char item in getItBack) { Console.WriteLine("{0} {1}", item, ((ushort)item).ToString("x")); } } static byte[] GetBytes(string str) { byte[] bytes = new byte[str.Length * sizeof(char)]; System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length); return bytes; } static string GetString(byte[] bytes) { char[] chars = new char[bytes.Length / sizeof(char)]; System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length); return new string(chars); } }
输出:
T 54 e 65 s 73 t 74 ? d800 T 54 e 65 s 73 t 74
尝试使用System.Text.Encoding.UTF8.GetBytes或System.Text.Encoding.Unicode.GetBytes ,它们只会replace高代理字符的值fffd
每当这个问题出现一个动作的时候,我仍然想着一个串行器(不pipe是来自微软还是来自第三方组件),即使它包含不成对的替代字符, 我偶尔谷歌这个: 序列化不成对代理字符.NET 。 这并不能使我失眠,但是有时候有人评论我的答案是有缺陷的,但是对于不成对的替代angular色,他们的答案同样是有缺陷的。
Darn,微软应该在其BinaryFormatter
ツ中使用了System.Buffer.BlockCopy
谢谢!
试试这个,less了很多代码:
System.Text.Encoding.UTF8.GetBytes("TEST String");
你的问题的第一部分(如何获取字节)已经被其他人回答了:查看System.Text.Encoding
命名空间。
我将解决你的后续问题:你为什么需要select一个编码? 为什么你不能从string类本身获得?
答案分两部分。
首先,string类内部使用的字节无关紧要 ,只要您认为它们可能会引入错误。
如果你的程序完全在.Net的世界里,那么即使你通过networking发送数据,你也不必担心获取string的字节数组。 而是使用.Net序列化来担心传输数据。 您不必担心实际的字节:序列化格式化程序为您做。
另一方面,如果你发送这些字节的地方,你不能保证将从.Net序列化的stream中的数据? 在这种情况下,你肯定需要担心编码,因为显然这个外部系统在意。 因此,string使用的内部字节无关紧要:您需要select一种编码,以便您可以在接收端明确这种编码,即使它与.Net内部使用的是相同的编码。
我明白,在这种情况下,您可能更愿意使用stringvariables存储在实际可能的字节variables字节,这可能会节省一些创build字节stream的工作。 然而,我把它给你,这是没有什么重要的,相比之下,确保你的输出是理解的另一端,并保证你必须明确你的编码。 另外,如果你真的想要匹配你的内部字节,你可以selectUnicode
编码,并节省下来。
这使我想到了第二部分…selectUnicode
编码告诉.Net使用底层字节。 你需要select这种编码,因为当一些新的Unicode-Plus出来时.Net运行时需要自由地使用这个更新的,更好的编码模型,而不会破坏你的程序。 但是,就目前来说(而且可以预见未来),只要selectUnicode编码就可以得到你想要的。
理解你的string必须重新写入连线也是很重要的, 即使在使用匹配的编码时也至less需要一些位模式的翻译。 计算机需要考虑Big vs Little Endian,networking字节顺序,打包,会话信息等。
那么,我已经读了所有的答案,他们是关于使用编码或一个关于序列化,丢弃不成对的替代品。
例如,当string来自SQL Server时 ,它是从一个存储密码哈希的字节数组构build的。 如果我们删除任何东西,它将存储一个无效的散列,如果我们想要将它存储在XML中,我们希望保持它完整(因为XML写入器在find它的任何不成对的代理上抛出exception)。
所以我在这种情况下使用了字节数组的Base64编码,但是嘿,在互联网上C#中只有一个这样的解决scheme,它有错误,只有一个方法,所以我修复了错误,并写回程序。 在这里,未来的谷歌:
public static byte[] StringToBytes(string str) { byte[] data = new byte[str.Length * 2]; for (int i = 0; i < str.Length; ++i) { char ch = str[i]; data[i * 2] = (byte)(ch & 0xFF); data[i * 2 + 1] = (byte)((ch & 0xFF00) >> 8); } return data; } public static string StringFromBytes(byte[] arr) { char[] ch = new char[arr.Length / 2]; for (int i = 0; i < ch.Length; ++i) { ch[i] = (char)((int)arr[i * 2] + (((int)arr[i * 2 + 1]) << 8)); } return new String(ch); }
也请解释为什么要考虑编码。 我不能简单地得到string已被存储在什么字节? 为什么这个编码依赖?
因为没有“string的字节”这样的东西。
一个string(或者更一般地说,一个文本)由字符组成:字母,数字和其他符号。 就这样。 然而,电脑对字符一无所知, 他们只能处理字节。 因此,如果要使用计算机存储或传输文本,则需要将字符转换为字节。 你如何做到这一点? 这是编码到达现场的地方。
编码不过是将逻辑字符转换为物理字节的约定。 最简单和最有名的编码是ASCII,如果你用英文写的话,这是你所需要的。 对于其他语言,您将需要更完整的编码,而任何Unicode编码都是当今最安全的select。
所以,简而言之,试图“不使用编码来获取string的字节”与“不使用任何语言编写文本”是不可能的。
顺便说一下,我强烈build议你(和任何人)阅读这一小小的智慧: 绝对最低限度的每一个软件开发人员绝对,积极必须知道Unicode和字符集(没有借口!)
C#将string
转换为byte
数组:
public static byte[] StrToByteArray(string str) { System.Text.UTF8Encoding encoding=new System.Text.UTF8Encoding(); return encoding.GetBytes(str); }
byte[] strToByteArray(string str) { System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding(); return enc.GetBytes(str); }
您可以使用以下代码在string和字节数组之间进行转换。
string s = "Hello World"; // String to Byte[] byte[] byte1 = System.Text.Encoding.Default.GetBytes(s); // OR byte[] byte2 = System.Text.ASCIIEncoding.Default.GetBytes(s); // Byte[] to string string str = System.Text.Encoding.UTF8.GetString(byte1);
I'm not sure, but I think the string stores its info as an array of Chars, which is inefficient with bytes. Specifically, the definition of a Char is "Represents a Unicode character".
take this example sample:
String str = "asdf éß"; String str2 = "asdf gh"; EncodingInfo[] info = Encoding.GetEncodings(); foreach (EncodingInfo enc in info) { System.Console.WriteLine(enc.Name + " - " + enc.GetEncoding().GetByteCount(str) + enc.GetEncoding().GetByteCount(str2)); }
Take note that the Unicode answer is 14 bytes in both instances, whereas the UTF-8 answer is only 9 bytes for the first, and only 7 for the second.
So if you just want the bytes used by the string, simply use Encoding.Unicode
, but it will be inefficient with storage space.
The key issue is that a glyph in a string takes 32 bits (16 bits for a character code) but a byte only has 8 bits to spare. A one-to-one mapping doesn't exist unless you restrict yourself to strings that only contain ASCII characters. System.Text.Encoding has lots of ways to map a string to byte[], you need to pick one that avoids loss of information and that is easy to use by your client when she needs to map the byte[] back to a string.
Utf8 is a popular encoding, it is compact and not lossy.
Fastest way
public static byte[] GetBytes(string text) { return System.Text.ASCIIEncoding.UTF8.GetBytes(text); }
EDIT as Makotosan commented this is now the best way:
Encoding.UTF8.GetBytes(text)
使用:
string text = "string"; byte[] array = System.Text.Encoding.UTF8.GetBytes(text);
结果是:
[0] = 115 [1] = 116 [2] = 114 [3] = 105 [4] = 110 [5] = 103
You can use following code to convert a string
to a byte array
in .NET
string s_unicode = "abcéabc"; byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(s_unicode);
The closest approach to the OP's question is Tom Blodget's, which actually goes into the object and extracts the bytes. I say closest because it depends on implementation of the String Object.
"Can't I simply get what bytes the string has been stored in?"
Sure, but that's where the fundamental error in the question arises. The String is an object which could have an interesting data structure. We already know it does, because it allows unpaired surrogates to be stored. It might store the length. It might keep a pointer to each of the 'paired' surrogates allowing quick counting. Etc. All of these extra bytes are not part of the character data.
What you want is each character's bytes in an array. And that is where 'encoding' comes in. By default you will get UTF-16LE. If you don't care about the bytes themselves except for the round trip then you can choose any encoding including the 'default', and convert it back later (assuming the same parameters such as what the default encoding was, code points, bug fixes, things allowed such as unpaired surrogates, etc.
But why leave the 'encoding' up to magic? Why not specify the encoding so that you know what bytes you are gonna get?
"Why is there a dependency on character encodings?"
Encoding (in this context) simply means the bytes that represent your string. Not the bytes of the string object. You wanted the bytes the string has been stored in — this is where the question was asked naively. You wanted the bytes of string in a contiguous array that represent the string, and not all of the other binary data that a string object may contain.
Which means how a string is stored is irrelevant. You want a string "Encoded" into bytes in a byte array.
I like Tom Bloget's answer because he took you towards the 'bytes of the string object' direction. It's implementation dependent though, and because he's peeking at internals it might be difficult to reconstitute a copy of the string.
Mehrdad's response is wrong because it is misleading at the conceptual level. You still have a list of bytes, encoded. His particular solution allows for unpaired surrogates to be preserved — this is implementation dependent. His particular solution would not produce the string's bytes accurately if GetBytes
returned the string in UTF-8 by default.
I've changed my mind about this (Mehrdad's solution) — this isn't getting the bytes of the string; rather it is getting the bytes of the character array that was created from the string. Regardless of encoding, the char datatype in c# is a fixed size. This allows a consistent length byte array to be produced, and it allows the character array to be reproduced based on the size of the byte array. So if the encoding were UTF-8, but each char was 6 bytes to accommodate the largest utf8 value, it would still work. So indeed — encoding of the character does not matter.
But a conversion was used — each character was placed into a fixed size box (c#'s character type). However what that representation is does not matter, which is technically the answer to the OP. So — if you are going to convert anyway… Why not 'encode'?
Here is my unsafe implementation of String
to Byte[]
conversion:
public static unsafe Byte[] GetBytes(String s) { Int32 length = s.Length * sizeof(Char); Byte[] bytes = new Byte[length]; fixed (Char* pInput = s) fixed (Byte* pBytes = bytes) { Byte* source = (Byte*)pInput; Byte* destination = pBytes; if (length >= 16) { do { *((Int64*)destination) = *((Int64*)source); *((Int64*)(destination + 8)) = *((Int64*)(source + 8)); source += 16; destination += 16; } while ((length -= 16) >= 16); } if (length > 0) { if ((length & 8) != 0) { *((Int64*)destination) = *((Int64*)source); source += 8; destination += 8; } if ((length & 4) != 0) { *((Int32*)destination) = *((Int32*)source); source += 4; destination += 4; } if ((length & 2) != 0) { *((Int16*)destination) = *((Int16*)source); source += 2; destination += 2; } if ((length & 1) != 0) { ++source; ++destination; destination[0] = source[0]; } } } return bytes; }
It's way faster than the accepted anwser's one, even if not as elegant as it is. Here are my Stopwatch benchmarks over 10000000 iterations:
[Second String: Length 20] Buffer.BlockCopy: 746ms Unsafe: 557ms [Second String: Length 50] Buffer.BlockCopy: 861ms Unsafe: 753ms [Third String: Length 100] Buffer.BlockCopy: 1250ms Unsafe: 1063ms
In order to use it, you have to tick "Allow Unsafe Code" in your project build properties. As per .NET Framework 3.5, this method can also be used as String extension:
public static unsafe class StringExtensions { public static Byte[] ToByteArray(this String s) { // Method Code } }
两种方式:
public static byte[] StrToByteArray(this string s) { List<byte> value = new List<byte>(); foreach (char c in s.ToCharArray()) value.Add(c.ToByte()); return value.ToArray(); }
和,
public static byte[] StrToByteArray(this string s) { s = s.Replace(" ", string.Empty); byte[] buffer = new byte[s.Length / 2]; for (int i = 0; i < s.Length; i += 2) buffer[i / 2] = (byte)Convert.ToByte(s.Substring(i, 2), 16); return buffer; }
I tend to use the bottom one more often than the top, haven't benchmarked them for speed.
bytes[] buffer = UnicodeEncoding.UTF8.GetBytes(string something); //for converting to UTF then get its bytes bytes[] buffer = ASCIIEncoding.ASCII.GetBytes(string something); //for converting to ascii then get its bytes
simple code with LINQ
string s = "abc" byte[] b = s.Select(e => (byte)e).ToArray();
EDIT : as commented below, it is not a good way.
but you can still use it to understand LINQ with a more appropriate coding :
string s = "abc" byte[] b = s.Cast<byte>().ToArray();
Simply use this:
byte[] myByte= System.Text.ASCIIEncoding.Default.GetBytes(myString);
If you really want a copy of the underlying bytes of a string, you can use a function like the one that follows. However, you shouldn't please read on to find out why.
[DllImport( "msvcrt.dll", EntryPoint = "memcpy", CallingConvention = CallingConvention.Cdecl, SetLastError = false)] private static extern unsafe void* UnsafeMemoryCopy( void* destination, void* source, uint count); public static byte[] GetUnderlyingBytes(string source) { var length = source.Length * sizeof(char); var result = new byte[length]; unsafe { fixed (char* firstSourceChar = source) fixed (byte* firstDestination = result) { var firstSource = (byte*)firstSourceChar; UnsafeMemoryCopy( firstDestination, firstSource, (uint)length); } } return result; }
This function will get you a copy of the bytes underlying your string, pretty quickly. You'll get those bytes in whatever way they are encoding on your system. This encoding is almost certainly UTF-16LE but that is an implementation detail you shouldn't have to care about.
It would be safer, simpler and more reliable to just call,
System.Text.Encoding.Unicode.GetBytes()
In all likelihood this will give the same result, is easier to type, and the bytes will always round-trip with a call to
System.Text.Encoding.Unicode.GetString()
The string can be converted to byte array in few different ways, due to the following fact: .NET supports Unicode, and Unicode standardizes several difference encodings called UTFs. They have different lengths of byte representation but are equivalent in that sense that when a string is encoded, it can be coded back to the string, but if the string is encoded with one UTF and decoded in the assumption of different UTF if can be screwed up.
Also, .NET supports non-Unicode encodings, but they are not valid in general case (will be valid only if a limited sub-set of Unicode code point is used in an actual string, such as ASCII). Internally, .NET supports UTF-16, but for stream representation, UTF-8 is usually used. It is also a standard-de-facto for Internet.
Not surprisingly, serialization of string into an array of byte and deserialization is supported by the class System.Text.Encoding
, which is an abstract class; its derived classes support concrete encodings: ASCIIEncoding
and four UTFs ( System.Text.UnicodeEncoding
supports UTF-16)
Ref this link.
For serialization to an array of bytes using System.Text.Encoding.GetBytes
. For the inverse operation use System.Text.Encoding.GetChars
. This function returns an array of characters, so to get a string, use a string constructor System.String(char[])
.
Ref this page.
例:
string myString = //... some string System.Text.Encoding encoding = System.Text.Encoding.UTF8; //or some other, but prefer some UTF is Unicode is used byte[] bytes = encoding.GetBytes(myString); //next lines are written in response to a follow-up questions: myString = new string(encoding.GetChars(bytes)); byte[] bytes = encoding.GetBytes(myString); myString = new string(encoding.GetChars(bytes)); byte[] bytes = encoding.GetBytes(myString); //how many times shall I repeat it to show there is a round-trip? :-)
// C# to convert a string to a byte array. public static byte[] StrToByteArray(string str) { System.Text.ASCIIEncoding encoding=new System.Text.ASCIIEncoding(); return encoding.GetBytes(str); } // C# to convert a byte array to a string. byte [] dBytes = ... string str; System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding(); str = enc.GetString(dBytes);
string s = "abcdefghijklmnopqrstuvwxyz"; byte[] b = new System.Text.UTF32Encoding().GetBytes(s);
From byte[]
to string
:
return BitConverter.ToString(bytes);