将任意GUID编码为可读的ASCII(33-127)的最有效方法是什么?
GUID的标准string表示大约需要36个字符。 这是非常好的,但也很浪费。 我想知道如何使用33-127范围内的所有ASCII字符以尽可能最短的方式对其进行编码。 天真的实现产生22个字符,只是因为128位 / 6位产生22。
霍夫曼编码是我的第二好,唯一的问题是如何select代码….
当然,编码必须是无损的。
使用基地85.请参阅4.1节。 为什么85? 紧凑的IPv6地址表示方法
一个IPv6地址,就像一个GUID由八个16位块组成。
这是一个古老的问题,但我必须解决这个问题,以便我正在努力实现向后兼容的系统。
确切的要求是客户端生成的标识符将写入数据库并存储在20个字符的唯一列中。 它从来没有显示给用户,也没有以任何方式索引。
由于我无法消除这个要求,所以我真的希望使用一个Guid(这在统计上是独一无二的 ),如果我可以将它无损地编码成20个字符,那么考虑到约束条件,这将是一个很好的解决scheme。
Ascii-85允许您将4个字节的二进制数据编码为5个字节的Ascii数据。 所以一个16字节的GUID只能用这种编码方式放入20个ASCII字符中。 Guid可以有3.1962657931507848761677563491821e + 38个离散值,而20个字符的Ascii-85可以有3.8759531084514355873123178482056e + 38个离散值。
写入数据库时,我对截断有一些担心,所以在编码中不包含空格字符。 我也遇到了sorting问题,我通过从编码中排除小写字符来解决这个问题。 此外,它只会通过参数化命令 ,所以任何特殊的SQL字符都会自动转义。
我已经包含了C#代码来执行Ascii-85编码和解码以防止任何人在那里。 显然,取决于你的用法,你可能需要select不同的字符集,因为我的约束使我select了一些不寻常的字符,如“ß”和“Ø” – 但这很容易:
/// <summary> /// This code implements an encoding scheme that uses 85 printable ascii characters /// to encode the same volume of information as contained in a Guid. /// /// Ascii-85 can represent 4 binary bytes as 5 Ascii bytes. So a 16 byte Guid can be /// represented in 20 Ascii bytes. A Guid can have /// 3.1962657931507848761677563491821e+38 discrete values whereas 20 characters of /// Ascii-85 can have 3.8759531084514355873123178482056e+38 discrete values. /// /// Lower-case characters are not included in this encoding to avoid collation /// issues. /// This is a departure from standard Ascii-85 which does include lower case /// characters. /// In addition, no whitespace characters are included as these may be truncated in /// the database depending on the storage mechanism - ie VARCHAR vs CHAR. /// </summary> internal static class Ascii85 { /// <summary> /// 85 printable ascii characters with no lower case ones, so database /// collation can't bite us. No ' ' character either so database can't /// truncate it! /// Unfortunately, these limitation mean resorting to some strange /// characters like 'Æ' but we won't ever have to type these, so it's ok. /// </summary> private static readonly char[] kEncodeMap = new[] { '0','1','2','3','4','5','6','7','8','9', // 10 'A','B','C','D','E','F','G','H','I','J', // 20 'K','L','M','N','O','P','Q','R','S','T', // 30 'U','V','W','X','Y','Z','|','}','~','{', // 40 '!','"','#','$','%','&','\'','(',')','`', // 50 '*','+',',','-','.','/','[','\\',']','^', // 60 ':',';','<','=','>','?','@','_','¼','½', // 70 '¾','ß','Ç','Ð','€','«','»','¿','•','Ø', // 80 '£','†','‡','§','¥' // 85 }; /// <summary> /// A reverse mapping of the <see cref="kEncodeMap"/> array for decoding /// purposes. /// </summary> private static readonly IDictionary<char, byte> kDecodeMap; /// <summary> /// Initialises the <see cref="kDecodeMap"/>. /// </summary> static Ascii85() { kDecodeMap = new Dictionary<char, byte>(); for (byte i = 0; i < kEncodeMap.Length; i++) { kDecodeMap.Add(kEncodeMap[i], i); } } /// <summary> /// Decodes an Ascii-85 encoded Guid. /// </summary> /// <param name="ascii85Encoding">The Guid encoded using Ascii-85.</param> /// <returns>A Guid decoded from the parameter.</returns> public static Guid Decode(string ascii85Encoding) { // Ascii-85 can encode 4 bytes of binary data into 5 bytes of Ascii. // Since a Guid is 16 bytes long, the Ascii-85 encoding should be 20 // characters long. if(ascii85Encoding.Length != 20) { throw new ArgumentException( "An encoded Guid should be 20 characters long.", "ascii85Encoding"); } // We only support upper case characters. ascii85Encoding = ascii85Encoding.ToUpper(); // Split the string in half and decode each substring separately. var higher = ascii85Encoding.Substring(0, 10).AsciiDecode(); var lower = ascii85Encoding.Substring(10, 10).AsciiDecode(); // Convert the decoded substrings into an array of 16-bytes. var byteArray = new[] { (byte)((higher & 0xFF00000000000000) >> 56), (byte)((higher & 0x00FF000000000000) >> 48), (byte)((higher & 0x0000FF0000000000) >> 40), (byte)((higher & 0x000000FF00000000) >> 32), (byte)((higher & 0x00000000FF000000) >> 24), (byte)((higher & 0x0000000000FF0000) >> 16), (byte)((higher & 0x000000000000FF00) >> 8), (byte)((higher & 0x00000000000000FF)), (byte)((lower & 0xFF00000000000000) >> 56), (byte)((lower & 0x00FF000000000000) >> 48), (byte)((lower & 0x0000FF0000000000) >> 40), (byte)((lower & 0x000000FF00000000) >> 32), (byte)((lower & 0x00000000FF000000) >> 24), (byte)((lower & 0x0000000000FF0000) >> 16), (byte)((lower & 0x000000000000FF00) >> 8), (byte)((lower & 0x00000000000000FF)), }; return new Guid(byteArray); } /// <summary> /// Encodes binary data into a plaintext Ascii-85 format string. /// </summary> /// <param name="guid">The Guid to encode.</param> /// <returns>Ascii-85 encoded string</returns> public static string Encode(Guid guid) { // Convert the 128-bit Guid into two 64-bit parts. var byteArray = guid.ToByteArray(); var higher = ((UInt64)byteArray[0] << 56) | ((UInt64)byteArray[1] << 48) | ((UInt64)byteArray[2] << 40) | ((UInt64)byteArray[3] << 32) | ((UInt64)byteArray[4] << 24) | ((UInt64)byteArray[5] << 16) | ((UInt64)byteArray[6] << 8) | byteArray[7]; var lower = ((UInt64)byteArray[ 8] << 56) | ((UInt64)byteArray[ 9] << 48) | ((UInt64)byteArray[10] << 40) | ((UInt64)byteArray[11] << 32) | ((UInt64)byteArray[12] << 24) | ((UInt64)byteArray[13] << 16) | ((UInt64)byteArray[14] << 8) | byteArray[15]; var encodedStringBuilder = new StringBuilder(); // Encode each part into an ascii-85 encoded string. encodedStringBuilder.AsciiEncode(higher); encodedStringBuilder.AsciiEncode(lower); return encodedStringBuilder.ToString(); } /// <summary> /// Encodes the given integer using Ascii-85. /// </summary> /// <param name="encodedStringBuilder">The <see cref="StringBuilder"/> to /// append the results to.</param> /// <param name="part">The integer to encode.</param> private static void AsciiEncode( this StringBuilder encodedStringBuilder, UInt64 part) { // Nb, the most significant digits in our encoded character will // be the right-most characters. var charCount = (UInt32)kEncodeMap.Length; // Ascii-85 can encode 4 bytes of binary data into 5 bytes of Ascii. // Since a UInt64 is 8 bytes long, the Ascii-85 encoding should be // 10 characters long. for (var i = 0; i < 10; i++) { // Get the remainder when dividing by the base. var remainder = part % charCount; // Divide by the base. part /= charCount; // Add the appropriate character for the current value (0-84). encodedStringBuilder.Append(kEncodeMap[remainder]); } } /// <summary> /// Decodes the given string from Ascii-85 to an integer. /// </summary> /// <param name="ascii85EncodedString">Decodes a 10 character Ascii-85 /// encoded string.</param> /// <returns>The integer representation of the parameter.</returns> private static UInt64 AsciiDecode(this string ascii85EncodedString) { if (ascii85EncodedString.Length != 10) { throw new ArgumentException( "An Ascii-85 encoded Uint64 should be 10 characters long.", "ascii85EncodedString"); } // Nb, the most significant digits in our encoded character // will be the right-most characters. var charCount = (UInt32)kEncodeMap.Length; UInt64 result = 0; // Starting with the right-most (most-significant) character, // iterate through the encoded string and decode. for (var i = ascii85EncodedString.Length - 1; i >= 0; i--) { // Multiply the current decoded value by the base. result *= charCount; // Add the integer value for that encoded character. result += kDecodeMap[ascii85EncodedString[i]]; } return result; } }
另外,这里是unit testing。 他们没有我想要的那么彻底,我不喜欢Guid.NewGuid()
被使用的非确定性,但他们应该让你开始:
/// <summary> /// Tests to verify that the Ascii-85 encoding is functioning as expected. /// </summary> [TestClass] [UsedImplicitly] public class Ascii85Tests { [TestMethod] [Description("Ensure that the Ascii-85 encoding is correct.")] [UsedImplicitly] public void CanEncodeAndDecodeAGuidUsingAscii85() { var guidStrings = new[] { "00000000-0000-0000-0000-000000000000", "00000000-0000-0000-0000-0000000000FF", "00000000-0000-0000-0000-00000000FF00", "00000000-0000-0000-0000-000000FF0000", "00000000-0000-0000-0000-0000FF000000", "00000000-0000-0000-0000-00FF00000000", "00000000-0000-0000-0000-FF0000000000", "00000000-0000-0000-00FF-000000000000", "00000000-0000-0000-FF00-000000000000", "00000000-0000-00FF-0000-000000000000", "00000000-0000-FF00-0000-000000000000", "00000000-00FF-0000-0000-000000000000", "00000000-FF00-0000-0000-000000000000", "000000FF-0000-0000-0000-000000000000", "0000FF00-0000-0000-0000-000000000000", "00FF0000-0000-0000-0000-000000000000", "FF000000-0000-0000-0000-000000000000", "FF000000-0000-0000-0000-00000000FFFF", "00000000-0000-0000-0000-0000FFFF0000", "00000000-0000-0000-0000-FFFF00000000", "00000000-0000-0000-FFFF-000000000000", "00000000-0000-FFFF-0000-000000000000", "00000000-FFFF-0000-0000-000000000000", "0000FFFF-0000-0000-0000-000000000000", "FFFF0000-0000-0000-0000-000000000000", "00000000-0000-0000-0000-0000FFFFFFFF", "00000000-0000-0000-FFFF-FFFF00000000", "00000000-FFFF-FFFF-0000-000000000000", "FFFFFFFF-0000-0000-0000-000000000000", "00000000-0000-0000-FFFF-FFFFFFFFFFFF", "FFFFFFFF-FFFF-FFFF-0000-000000000000", "FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF", "1000000F-100F-100F-100F-10000000000F" }; foreach (var guidString in guidStrings) { var guid = new Guid(guidString); var encoded = Ascii85.Encode(guid); Assert.AreEqual( 20, encoded.Length, "A guid encoding should not exceed 20 characters."); var decoded = Ascii85.Decode(encoded); Assert.AreEqual( guid, decoded, "The guids are different after being encoded and decoded."); } } [TestMethod] [Description( "The Ascii-85 encoding is not susceptible to changes in character case.")] [UsedImplicitly] public void Ascii85IsCaseInsensitive() { const int kCount = 50; for (var i = 0; i < kCount; i++) { var guid = Guid.NewGuid(); // The encoding should be all upper case. A reliance // on mixed case will make the generated string // vulnerable to sql collation. var encoded = Ascii85.Encode(guid); Assert.AreEqual( encoded, encoded.ToUpper(), "The Ascii-85 encoding should produce only uppercase characters."); } } }
我希望这可以节省一些麻烦。
另外,如果你发现任何错误,那么让我知道;-)
你有95个字符可用 – 所以,超过6位,但不是多达7(实际上约6.57)。 您可以使用128 / log2(95)=大约19.48个字符,编码为20个字符。 如果以编码forms保存2个字符值得丢失可读性,就像(伪代码):
char encoded[21]; long long guid; // 128 bits number for(int i=0; i<20; ++i) { encoded[i] = chr(guid % 95 + 33); guid /= 95; } encoded[20] = chr(0);
它基本上是通用的“在一些基础上编码”的代码,除了不需要反转“数字”,因为顺序的随意(小端更直接和自然)。 为了从编码string中得到guid,以非常相似的方式,以95为底的多项式计算(从每个数字中减去33后):
guid = 0; for(int i=0; i<20; ++i) { guid *= 95; guid += ord(encoded[i]) - 33; }
基本上使用Horner的多项式评估方法。
只需去Base64 。
使用33的全部范围(顺便说一下,错误的wirh空间是怎么回事?)127给你95个可能的字符。 表示基地95中的2^128
可能的guid值将使用20个字符。 这(模仿像下降nybbles将是不变的)是最好的,你可以做的。 节省自己的麻烦 – 使用基地64。
假设所有GUID都是由相同的algorithm生成的,那么在应用任何其他编码之前,您可以通过不对algorithm半字节进行编码来节省4位。
一个任意的 GUID? “天真”algorithm将产生最佳结果。 进一步压缩GUID的唯一方法是利用“任意”约束排除的数据中的模式。