如何在PHP中replaceMicrosoft编码的引号

我需要用正则引号('和“ ” ' ' )replaceMicrosoft Word版本的单引号和双引号( “ ” ' ' ),因为我的应用程序中存在编码问题。我不需要它们是HTML实体,我不能改变我的数据库模式。

我有两个select:使用正则expression式或关联的数组。

有一个更好的方法吗?

考虑到你只想replace一些特定的和明确的字符,我会用str_replace一个数组:你显然不需要重火力的正则expression式会带给你;-)

如果你遇到一些其他的特殊字符(该死的复制粘贴从单词…),你可以只要将它们添加到该数组每当必要/何时被识别。

编辑:我可以给你的评论的最佳答案可能是这个链接: 用PHP转换智能行情

和相关的代码(引用该页面)

 function convert_smart_quotes($string) { $search = array(chr(145), chr(146), chr(147), chr(148), chr(151)); $replace = array("'", "'", '"', '"', '-'); return str_replace($search, $replace, $string); } 

(我在这台电脑上没有MS字样,所以我不能自己testing)

我不记得我们在工作中使用了什么(我不是那种需要处理这种input的人) ,但是它是同样的东西。

我find了这个问题的答案。 你只需要在php中使用iconv()函数的一行代码:

 // replace Microsoft Word version of single and double quotations marks (“ ” ' ') with regular quotes (' and ") $output = iconv('UTF-8', 'ASCII//TRANSLIT', $input); 

您的Microsoft编码的引号可能是印刷引号 。 如果您知道要replace它们的string的编码,则可以简单地将它们replace为str_replace

下面是UTF-8的一个例子,但是使用一个映射数组和strtr

 $quotes = array( "\xC2\xAB" => '"', // « (U+00AB) in UTF-8 "\xC2\xBB" => '"', // » (U+00BB) in UTF-8 "\xE2\x80\x98" => "'", // ' (U+2018) in UTF-8 "\xE2\x80\x99" => "'", // ' (U+2019) in UTF-8 "\xE2\x80\x9A" => "'", // ‚ (U+201A) in UTF-8 "\xE2\x80\x9B" => "'", // ‛ (U+201B) in UTF-8 "\xE2\x80\x9C" => '"', // “ (U+201C) in UTF-8 "\xE2\x80\x9D" => '"', // ” (U+201D) in UTF-8 "\xE2\x80\x9E" => '"', // „ (U+201E) in UTF-8 "\xE2\x80\x9F" => '"', // ‟ (U+201F) in UTF-8 "\xE2\x80\xB9" => "'", // ‹ (U+2039) in UTF-8 "\xE2\x80\xBA" => "'", // › (U+203A) in UTF-8 ); $str = strtr($str, $quotes); 

如果您需要另一种编码,可以使用mb_convert_encoding转换密钥。

如果像我一样,你到达这里的时候,你的CMS或者RTE里会有大量的破碎的ascii / ms字符,而且iconv不能正常工作,那么这个疯狂的函数可能就是为你准备的。

将此函数保存到文件时,请确保您的编码是utf-8。

 <?php /** * fixMSWord * * Replace ascii chars with utf8. Note there are ascii characters that don't * correctly map and will be replaced by spaces. * * @author Robin Cafolla * @date 2013-03-22 * @Copyright (c) 2013 Robin Cafolla * @licence MIT (x11) http://opensource.org/licenses/MIT */ function fixMSWord($string) { $map = Array( '33' => '!', '34' => '"', '35' => '#', '36' => '$', '37' => '%', '38' => '&', '39' => "'", '40' => '(', '41' => ')', '42' => '*', '43' => '+', '44' => ',', '45' => '-', '46' => '.', '47' => '/', '48' => '0', '49' => '1', '50' => '2', '51' => '3', '52' => '4', '53' => '5', '54' => '6', '55' => '7', '56' => '8', '57' => '9', '58' => ':', '59' => ';', '60' => '<', '61' => '=', '62' => '>', '63' => '?', '64' => '@', '65' => 'A', '66' => 'B', '67' => 'C', '68' => 'D', '69' => 'E', '70' => 'F', '71' => 'G', '72' => 'H', '73' => 'I', '74' => 'J', '75' => 'K', '76' => 'L', '77' => 'M', '78' => 'N', '79' => 'O', '80' => 'P', '81' => 'Q', '82' => 'R', '83' => 'S', '84' => 'T', '85' => 'U', '86' => 'V', '87' => 'W', '88' => 'X', '89' => 'Y', '90' => 'Z', '91' => '[', '92' => '\\', '93' => ']', '94' => '^', '95' => '_', '96' => '`', '97' => 'a', '98' => 'b', '99' => 'c', '100'=> 'd', '101'=> 'e', '102'=> 'f', '103'=> 'g', '104'=> 'h', '105'=> 'i', '106'=> 'j', '107'=> 'k', '108'=> 'l', '109'=> 'm', '110'=> 'n', '111'=> 'o', '112'=> 'p', '113'=> 'q', '114'=> 'r', '115'=> 's', '116'=> 't', '117'=> 'u', '118'=> 'v', '119'=> 'w', '120'=> 'x', '121'=> 'y', '122'=> 'z', '123'=> '{', '124'=> '|', '125'=> '}', '126'=> '~', '127'=> ' ', '128'=> '&#8364;', '129'=> ' ', '130'=> ',', '131'=> ' ', '132'=> '"', '133'=> '.', '134'=> ' ', '135'=> ' ', '136'=> '^', '137'=> ' ', '138'=> ' ', '139'=> '<', '140'=> ' ', '141'=> ' ', '142'=> ' ', '143'=> ' ', '144'=> ' ', '145'=> "'", '146'=> "'", '147'=> '"', '148'=> '"', '149'=> '.', '150'=> '-', '151'=> '-', '152'=> '~', '153'=> ' ', '154'=> ' ', '155'=> '>', '156'=> ' ', '157'=> ' ', '158'=> ' ', '159'=> ' ', '160'=> ' ', '161'=> '¡', '162'=> '¢', '163'=> '£', '164'=> '¤', '165'=> '¥', '166'=> '¦', '167'=> '§', '168'=> '¨', '169'=> '©', '170'=> 'ª', '171'=> '«', '172'=> '¬', '173'=> '', '174'=> '®', '175'=> '¯', '176'=> '°', '177'=> '±', '178'=> '²', '179'=> '³', '180'=> '´', '181'=> 'µ', '182'=> '¶', '183'=> '·', '184'=> '¸', '185'=> '¹', '186'=> 'º', '187'=> '»', '188'=> '¼', '189'=> '½', '190'=> '¾', '191'=> '¿', '192'=> 'À', '193'=> 'Á', '194'=> 'Â', '195'=> 'Ã', '196'=> 'Ä', '197'=> 'Å', '198'=> 'Æ', '199'=> 'Ç', '200'=> 'È', '201'=> 'É', '202'=> 'Ê', '203'=> 'Ë', '204'=> 'Ì', '205'=> 'Í', '206'=> 'Î', '207'=> 'Ï', '208'=> 'Ð', '209'=> 'Ñ', '210'=> 'Ò', '211'=> 'Ó', '212'=> 'Ô', '213'=> 'Õ', '214'=> 'Ö', '215'=> '×', '216'=> 'Ø', '217'=> 'Ù', '218'=> 'Ú', '219'=> 'Û', '220'=> 'Ü', '221'=> 'Ý', '222'=> 'Þ', '223'=> 'ß', '224'=> 'à', '225'=> 'á', '226'=> 'â', '227'=> 'ã', '228'=> 'ä', '229'=> 'å', '230'=> 'æ', '231'=> 'ç', '232'=> 'è', '233'=> 'é', '234'=> 'ê', '235'=> 'ë', '236'=> 'ì', '237'=> 'í', '238'=> 'î', '239'=> 'ï', '240'=> 'ð', '241'=> 'ñ', '242'=> 'ò', '243'=> 'ó', '244'=> 'ô', '245'=> 'õ', '246'=> 'ö', '247'=> '÷', '248'=> 'ø', '249'=> 'ù', '250'=> 'ú', '251'=> 'û', '252'=> 'ü', '253'=> 'ý', '254'=> 'þ', '255'=> 'ÿ' ); $search = Array(); $replace = Array(); foreach ($map as $s => $r) { $search[] = chr((int)$s); $replace[] = $r; } return str_replace($search, $replace, $string); } 

我们使用了以下内容。 处理几个特殊字符。

 $text = str_replace(chr(130), ',', $text); // baseline single quote $text = str_replace(chr(132), '"', $text); // baseline double quote $text = str_replace(chr(133), '...', $text); // ellipsis $text = str_replace(chr(145), "'", $text); // left single quote $text = str_replace(chr(146), "'", $text); // right single quote $text = str_replace(chr(147), '"', $text); // left double quote $text = str_replace(chr(148), '"', $text); // right double quote $text = mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8'); 

除了@ Gumbo之外,以前的答案中的每一个都会破坏Unicodestring:

 echo convert_smart_quotes("This is Yi: ꑑ. Point ⒒ this breaks Yi. Yi broke–why? I need a longer––point. This makes Han 嗗 mad."); 

结果是:

 This is Yi: ?''. Point ?'' this breaks Yi. Yi broke?"why? I need a longer?"?"point. This makes Han ?-- mad. 

iconv:

 $output = iconv('UTF-8', 'ASCII//TRANSLIT', $input); 

结果是:

 PHP Notice: iconv(): Detected an illegal character in input string in php shell code on line 1 

您可以将其更改为//IGNORE ,这将删除字符,但不能翻译它们。

这是取代CP1252中编码的Microsoft报价的最佳方法。 如果他们在Unicode,你需要更换它们,使用Gumbo的答案:

 function convert_cp1252_to_ascii($input, $default = '') { if ($input === null || $input == '') { return $default; } // https://en.wikipedia.org/wiki/UTF-8 // https://en.wikipedia.org/wiki/ISO/IEC_8859-1 // https://en.wikipedia.org/wiki/Windows-1252 // http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT $encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true); if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') { /* * Use the search/replace arrays if a character needs to be replaced with * something other than its Unicode equivalent. */ $replace = array( 128 => "E", // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN 129 => "", // UNDEFINED 130 => ",", // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK 131 => "f", // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK 132 => ",,", // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK 133 => "...", // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS 134 => "t", // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER 135 => "T", // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER 136 => "^", // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT 137 => "%", // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN 138 => "S", // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON 139 => "<", // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK 140 => "OE", // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE 141 => "", // UNDEFINED 142 => "Z", // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON 143 => "", // UNDEFINED 144 => "", // UNDEFINED 145 => "'", // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK 146 => "'", // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK 147 => "\"", // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK 148 => "\"", // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK 149 => "*", // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET 150 => "-", // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH 151 => "--", // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH 152 => "~", // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE 153 => "TM", // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN 154 => "s", // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON 155 => ">", // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK 156 => "oe", // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE 157 => "", // UNDEFINED 158 => "z", // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON 159 => "Y", // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS ); $find = array(); foreach (array_keys($replace) as $key) { $find[] = chr($key); } $input = str_replace($find, array_values($replace), $input); /* * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F * and control characters, always convert from Windows-1252 to UTF-8. */ $input = iconv('Windows-1252', 'UTF-8//IGNORE', $input); } return $input; } 

从这个答案采取了一些修改。 如果你想控制你find/replace,使用该function。