如何从PHP中的unicode代码点获取字符?
例如,
如何获得对应于U + 010F的字符?
header('Content-Encoding: UTF-8'); function mb_html_entity_decode($string) { if (extension_loaded('mbstring') === true) { mb_language('Neutral'); mb_internal_encoding('UTF-8'); mb_detect_order(array('UTF-8', 'ISO-8859-15', 'ISO-8859-1', 'ASCII')); return mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES'); } return html_entity_decode($string, ENT_COMPAT, 'UTF-8'); } function mb_ord($string) { if (extension_loaded('mbstring') === true) { mb_language('Neutral'); mb_internal_encoding('UTF-8'); mb_detect_order(array('UTF-8', 'ISO-8859-15', 'ISO-8859-1', 'ASCII')); $result = unpack('N', mb_convert_encoding($string, 'UCS-4BE', 'UTF-8')); if (is_array($result) === true) { return $result[1]; } } return ord($string); } function mb_chr($string) { return mb_html_entity_decode('&#' . intval($string) . ';'); } var_dump(hexdec('010F')); var_dump(mb_ord('ó')); // 243 var_dump(mb_chr(243)); // ó
我只是写了一个缺less多字节版本的ord
和chr
,并考虑到以下几点:
-
它只有在不存在时才定义函数
mb_ord
和mb_chr
。 如果他们确实存在于您的框架或未来版本的PHP中,则填充将被忽略。 -
它使用广泛使用的
mbstring
扩展来进行转换。 如果未加载mbstring
扩展名,则将使用iconv
扩展名。
编辑:
我将HTMLentities编码/解码和编码/解码function添加到JSON格式,以及一些演示代码,了解如何使用这些function
代码 :
if (!function_exists('codepoint_encode')) { function codepoint_encode($str) { return substr(json_encode($str), 1, -1); } } if (!function_exists('codepoint_decode')) { function codepoint_decode($str) { return json_decode(sprintf('"%s"', $str)); } } if (!function_exists('mb_internal_encoding')) { function mb_internal_encoding($encoding = NULL) { return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding); } } if (!function_exists('mb_convert_encoding')) { function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) { return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str); } } if (!function_exists('mb_chr')) { function mb_chr($ord, $encoding = 'UTF-8') { if ($encoding === 'UCS-4BE') { return pack("N", $ord); } else { return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE'); } } } if (!function_exists('mb_ord')) { function mb_ord($char, $encoding = 'UTF-8') { if ($encoding === 'UCS-4BE') { list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char); return $ord; } else { return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE'); } } } if (!function_exists('mb_htmlentities')) { function mb_htmlentities($string, $hex = true, $encoding = 'UTF-8') { return preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) use ($hex) { return sprintf($hex ? '&#x%X;' : '&#%d;', mb_ord($match[0])); }, $string); } } if (!function_exists('mb_html_entity_decode')) { function mb_html_entity_decode($string, $flags = null, $encoding = 'UTF-8') { return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding); } }
如何使用 :
echo "Get string from numeric DEC value\n"; var_dump(mb_chr(50319, 'UCS-4BE')); var_dump(mb_chr(271)); echo "\nGet string from numeric HEX value\n"; var_dump(mb_chr(0xC48F, 'UCS-4BE')); var_dump(mb_chr(0x010F)); echo "\nGet numeric value of character as DEC string\n"; var_dump(mb_ord('ď', 'UCS-4BE')); var_dump(mb_ord('ď')); echo "\nGet numeric value of character as HEX string\n"; var_dump(dechex(mb_ord('ď', 'UCS-4BE'))); var_dump(dechex(mb_ord('ď'))); echo "\nEncode / decode to DEC based HTML entities\n"; var_dump(mb_htmlentities('tchüß', false)); var_dump(mb_html_entity_decode('tchüß')); echo "\nEncode / decode to HEX based HTML entities\n"; var_dump(mb_htmlentities('tchüß')); var_dump(mb_html_entity_decode('tchüß')); echo "\nUse JSON encoding / decoding\n"; var_dump(codepoint_encode("tchüß")); var_dump(codepoint_decode('tch\u00fc\u00df'));
输出 :
Get string from numeric DEC value string(4) "ď" string(2) "ď" Get string from numeric HEX value string(4) "ď" string(2) "ď" Get numeric value of character as DEC int int(50319) int(271) Get numeric value of character as HEX string string(4) "c48f" string(3) "10f" Encode / decode to DEC based HTML entities string(15) "tchüß" string(7) "tchüß" Encode / decode to HEX based HTML entities string(15) "tchüß" string(7) "tchüß" Use JSON encoding / decoding string(15) "tch\u00fc\u00df" string(7) "tchüß"
IntlChar是一个新的基于ICU的内置类,以PHP / 7发布,完全解决了这个问题:
IntlChar提供了许多可用于访问Unicode字符信息的实用程序方法。
// PHP 7.0 and later var_dump( "\u{010F}" === IntlChar::chr(0x010F), 0x010F === IntlChar::ord("\u{010F}") ); // PHP 7.2.0-dev var_dump( "\u{010F}" === mb_chr(0x010F, "UTF-8"), 0x010F === mb_ord("\u{010F}", "UTF-8") );
如果您控制string的UTF-8编码(按照拉丁和其他欧洲标准的build议),您只需要
html_entity_decode($string, ENT_COMPAT, 'UTF-8');
参见php man的例子#1。 您可以将第二个参数更改为ENT_NOQUOTES等,如果您的string是标记语言(!),请付费 ,使用ENT_XHTML等。
<?php function chr_utf8($n,$f='C*'){ return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n): ($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n): ($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):''))); } echo chr_utf8(hexdec('010F')); // Output the UTF-8 character corresponding to U+010F