如何检查string是unicode还是ascii？

在Python中我需要做什么来弄清楚哪一种编码？

在Python 3中，所有的string都是Unicode字符的序列。有一个保存原始字节的bytestypes。

在Python 2中，一个string可能是str或unicodetypes。你可以告诉哪个代码如下所示：

 def whatisthis(s): if isinstance(s, str): print "ordinary string" elif isinstance(s, unicode): print "unicode string" else: print "not a string"

做就是了

 type(s)

一个会说unicode ，另一个会说str 。

你可以使用isinstance分别处理它们，例如

 if isinstance(s, str): print 's is a string object' elif isinstance(s, unicode): print 's is a unicode object'

或者你的意思是你有一个str ，而你试图弄清楚它是否使用ASCII或UTF-8或其他编码？

在这种情况下，试试这个：

 s.decode('ascii')

如果引发exception，string不是100％ASCII。

在python 3.x中，所有string都是Unicode字符的序列。并且对str进行isinstance检查（意味着默认情况下使用unicodestring）就足够了。

 isinstance(x, str)

关于Python 2.x，大多数人似乎使用了一个if语句，它有两个检查。一个是str，一个是unicode。

如果你想检查是否有一个“string”的对象，所有的一个语句，但你可以做到以下几点：

 isinstance(x, basestring)

Unicode不是一种编码 – 引用Kumar McMillan：

如果ASCII，UTF-8和其他字节string是“文本”…

那么Unicode就是“文本”。

它是文本的抽象forms

阅读McMillan的Unicode In Python，完全揭秘 PyCon 2008的话题，它比堆栈溢出的大多数相关答案解释得更好。

如果您的代码需要与Python 2和Python 3兼容，则不能直接使用isinstance(s,bytes)或isinstance(s,unicode)而不将其包装在try / except或python版本testing中，因为在Python 2中bytes是未定义的， unicode在Python 3中未定义。

有一些丑陋的解决方法。一个非常丑陋的是比较types的名称，而不是比较types本身。这是一个例子：

 # convert bytes (python 3) or unicode (python 2) to str if str(type(s)) == "<class 'bytes'>": # only possible in Python 3 s = s.decode('ascii') # or s = str(s)[2:-1] elif str(type(s)) == "<type 'unicode'>": # only possible in Python 2 s = str(s)

一个可以说是稍微不太丑陋的解决方法是检查Python的版本号，例如：

 if sys.version_info >= (3,0,0): # for Python 3 if isinstance(s, bytes): s = s.decode('ascii') # or s = str(s)[2:-1] else: # for Python 2 if isinstance(s, unicode): s = str(s)

这些都是和谐的，大多数时候可能有更好的方法。

使用：

 import six if isinstance(obj, six. text_type)

里面的六个库里面代表着：

 if PY3: string_types = str, else: string_types = basestring,

请注意，在Python 3中，说任何一个都不太公平：

str是任何x的UTFx（例如UTF8）
str是Unicode
str是Unicode字符的有序集合

Python的strtypes（通常）是一系列的Unicode代码点，其中一些映射到字符。

即使在Python 3中，回答这个问题也不像你想像的那么简单。

testingASCII兼容string的一个显而易见的方法是尝试编码：

 "Hello there!".encode("ascii") #>>> b'Hello there!' "Hello there... ☃!".encode("ascii") #>>> Traceback (most recent call last): #>>> File "", line 4, in <module> #>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

错误区分的情况。

在Python 3中，甚至有一些包含无效的Unicode代码点的string：

 "Hello there!".encode("utf8") #>>> b'Hello there!' "\udcc3".encode("utf8") #>>> Traceback (most recent call last): #>>> File "", line 19, in <module> #>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

使用相同的方法来区分它们。

您可以使用通用编码检测器，但请注意，它只会给你最好的猜测，而不是实际的编码，因为例如不可能知道string“abc”的编码。您需要在别处获取编码信息，例如HTTP协议使用Content-Type标头。

这可能对别人有帮助，我开始testingvariabless的stringtypes，但是对于我的应用程序来说，仅仅以utf-8的forms返回s就更有意义了。调用return_utf的进程知道它正在处理什么，并可以正确处理string。该代码不是原始的，但我打算它是Python版本agnostic没有版本testing或导入六。请对以下示例代码进行改进以帮助其他人。

 def return_utf(s): if isinstance(s, str): return s.encode('utf-8') if isinstance(s, int): return str(s).encode('utf-8') if isinstance(s, float): return str(s).encode('utf-8') if isinstance(s, complex): return str(s).encode('utf-8') try: return s.encode('utf-8') except TypeError: try: return str(s).encode('utf-8') except AttributeError: return s except AttributeError: return s return s # assume it was already utf-8

如何检查string是unicode还是ascii？

如何在C＃中用UTF-8以外的代码页写出文本文件？

AJAX POST和加号（+） – 如何编码？

我怎样才能在PHP中获得一个string的hex转储？

为表单提交Internet Explorer设置字符编码

如何在Maven中configuration编码？

如何在C＃中使用带有BOM的UTF8编码的GetBytes（）？

在JavaScript中编码的URL？

ruby 1.9：UTF-8中无效的字节序列

将int转换为ASCII并返回到Python

SQL Server 2005 T-SQL中的Base64编码