Python - Encoding and Unicode

Card Puncher Data Processing

Default

print sys.getdefaultencoding()
Cp1252

In PyDev, you can change it in the Run Configuration:

Pydev Default Encoding

and you get:

UTF-8

How to

get the console encoding

stdout:

import sys
print sys.stdout.encoding
Cp1252

get the system file encoding

print sys.getfilesystemencoding()
mbcs

Text - Double Byte Character Set (multi-byte character set ?)

get rid of the Bom

s = u"This is an unicode string".encode('utf-8-sig')
print s # You will see the BOM
print s.decode('utf-8-sig')
This is an unicode string
This is an unicode string

Environment variable

set PYTHONIOENCODING=UTF-8

Support

'charmap' codec can't encode character u'\ufeff'

UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>

Character \ufeff is a BOM

UnicodeEncodeError: 'charmap' codec can't encode character

The UnicodeEncodeError happens when encoding a unicode string into a certain coding.

Python encodes the output using default encoding then:

print u"\u20AC"

is equivalent to on a Windows platform:

print u"\u20AC".encode('Cp1252')

20AC is the Euro Sign as you can see in the Code page (cp) 1252

The codings mapping concerns only a limited number of unicode characters to str strings, a non-presented character will cause the coding-specific encode() to fail. The character set doesn't support all character.

For instance, the White heart suit (U+2661) is not present in the Cp1252 character set.

If you then try to print it, you will get a UnicodeEncodeError.

print u"\u2661"
Traceback (most recent call last):
  File "D:\workspace\PythonWorkpsace\mypackage\Test.py", line 1, in <module>
    print u"\u2661"
  File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2661' in position 0: character maps to <undefined>

To resolve this problem, you can:

  • encode it with a character set that support it.
print u"\u2661".encode('utf-8')
  • use the replace option of the encode function. It will replace an unknown character with a ?
print u"\u2661".encode(sys.getdefaultencoding(), 'replace')
?

Documentation / Reference





Discover More
Card Puncher Data Processing
Python - Unicode

unicode is an object type unicode. See also: split() can be called directly on a unicode or str object. For example, ...



Share this page:
Follow us:
Task Runner