Ошибка python invalid continuation byte

Why is the below item failing? Why does it succeed with «latin-1» codec?

o = "a test of xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")

Which results in:

 Traceback (most recent call last):  
 File "<stdin>", line 1, in <module>  
 File "C:Python27libencodingsutf_8.py",
 line 16, in decode
     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

Dr. Mantis Tobbogan's user avatar

asked Apr 5, 2011 at 13:23

RuiDC's user avatar

I had the same error when I tried to open a CSV file by pandas.read_csv
method.

The solution was change the encoding to latin-1:

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

Vishal Singh's user avatar

Vishal Singh

5,9392 gold badges17 silver badges33 bronze badges

answered Jul 18, 2015 at 15:33

Mazen Aly's user avatar

Mazen AlyMazen Aly

5,5151 gold badge14 silver badges12 bronze badges

2

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:

>>> b'xe9x80x80'.decode('utf-8')
u'u9000'

But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:

>>> u'xe9'.encode('utf-8')
b'xc3xa9'
>>> u'xe9'.encode('latin-1')
b'xe9'

(Note, I’m using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

answered Apr 5, 2011 at 13:29

Josh Lee's user avatar

Josh LeeJosh Lee

170k38 gold badges268 silver badges274 bronze badges

2

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.

If you don’t know the codeset you’re receiving strings in, you’re in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you’d just reject ones that didn’t decode.

If you can’t do that, you’ll need heuristics.

answered Apr 5, 2011 at 13:26

Sami J. Lehtinen's user avatar

1

Because UTF-8 is multibyte and there is no char corresponding to your combination of xe9 plus following space.

Why should it succeed in both utf-8 and latin-1?

Here how the same sentence should be in utf-8:

>>> o.decode('latin-1').encode("utf-8")
'a test of xc3xa9 char'

answered Apr 5, 2011 at 13:28

neurino's user avatar

neurinoneurino

11.3k2 gold badges40 silver badges62 bronze badges

1

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode

answered Jul 4, 2018 at 23:09

Patrick Mutuku's user avatar

2

Use this, If it shows the error of UTF-8

pd.read_csv('File_name.csv',encoding='latin-1')

answered Apr 14, 2020 at 7:21

Anshul Singh Suryan's user avatar

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.

the reason to raise this exception is:

1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

In order to to overcome this we have a set of encodings, the most widely used is «Latin-1, also known as ISO-8859-1»

So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1

when this exception occurs when you are trying to load a data set ,try using this format

df=pd.read_csv("top50.csv",encoding='ISO-8859-1')

Add encoding technique at the end of the syntax which then accepts to load the data set.

HK boy's user avatar

HK boy

1,39811 gold badges17 silver badges25 bronze badges

answered Jan 18, 2020 at 14:37

surya's user avatar

suryasurya

1811 silver badge3 bronze badges

1

Well this type of error comes when u are taking input a particular file or data in pandas such as :-

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)

Then the error is displaying like this :-
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xf4 in position 1: invalid continuation byte

So to avoid this type of error can be removed by adding an argument

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')

pppery's user avatar

pppery

3,70221 gold badges31 silver badges45 bronze badges

answered Jun 26, 2020 at 17:59

Aditya Aggarwal's user avatar

2

This happened to me also, while i was reading text containing Hebrew from a .txt file.

I clicked: file -> save as and I saved this file as a UTF-8 encoding

answered Feb 21, 2019 at 7:53

Alon Gouldman's user avatar

TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.

I got this error as I was processing a large number of zip files with additional zip files in them.

My workflow was the following:

  1. Read zip
  2. Read child zip
  3. Read text from child zip

At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.

answered Apr 17, 2022 at 10:32

malvoisen's user avatar

In this case, I tried to execute a .py which active a path/file.sql.

My solution was to modify the codification of the file.sql to «UTF-8 without BOM» and it works!

You can do it with Notepad++.

i will leave a part of my code.

con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])

cursor = con.cursor()
sqlfile = open(path, 'r')

Zrufy's user avatar

Zrufy

4139 silver badges22 bronze badges

answered Jun 19, 2019 at 21:26

Martin Taco's user avatar

I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.

What fixed it for me was first saving the sheet as an .xlsx file on my local computer, and from there exporting single sheet as .csv. Then the error went away for pd.read_csv('myfile.csv')

answered Sep 26, 2022 at 19:21

Nesha25's user avatar

Nesha25Nesha25

3623 silver badges11 bronze badges

The solution was change to «UTF-8 sin BOM»

answered Jun 2, 2021 at 21:06

masilva70 masilva70's user avatar

One error that you might encounter when working with Python is:

UnicodeDecodeError: invalid continuation byte

This error occurs when you try to decode a bytes object with an encoding that doesn’t support that character.

This tutorial shows an example that causes this error and how to fix it.

How to reproduce this error

Suppose you have a bytes object in your Python code as follows:

Next, you want to decode the bytes character using the utf-8 encoding like this:

str_obj = bytes_obj.decode('utf-8')

Output:

Traceback (most recent call last):
  File "main.py", line 3, in <module>
    str_obj = bytes_obj.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 
in position 0: invalid continuation byte

You get an error because the character xe1 in the bytes object is the á character encoded using latin-1 encoding.

How to fix this error

To resolve this error, you need to change the encoding used in the decode() method to latin-1 as follows:

bytes_obj = b"xe1 b c"

str_obj = bytes_obj.decode('latin-1')

print(str_obj)  # á b c

Note that this time the decode() method runs without any error.

You can also get this error when running other methods such as pandas read_csv() method.

You need to specify the encoding used by the method as follows:

pd.read_csv('example.csv', encoding='latin-1')

The same also works when you use the open() function to work with files:

csv_file = open('example.csv', encoding='latin-1')

# or:
with open('example.csv', encoding='latin-1') as file:

If you only want to read the files without modifying the content, you can use the open() function in rb read binary mode.

Here’s an example when you parse an HTML file using Beautiful Soup:

soup = BeautifulSoup(open('index.html', 'rb'), 'html.parser') 

print(soup.get_text())

When you decode the bytes object, you need to use the encoding that supports the object.

If you don’t want to encode the object when opening a file, you need to specify the open mode as rb or wb to read and write in binary mode.

I hope this tutorial helps. See you in other tutorials! 👍

The «UnicodeDecodeError: invalid continuation byte» error in Python is usually raised when a string of text being processed is not properly encoded as Unicode. This error can occur when reading data from a file or from a database, or when processing data from an external source. To resolve this error, it’s important to understand how the data is being encoded and to make sure that it’s properly decoded before being processed in Python.

Method 1: Use the correct encoding

When you encounter the UnicodeDecodeError with the message «invalid continuation byte», it means that Python is trying to decode a byte sequence that is not valid for the specified encoding. This error can be fixed by using the correct encoding.

Here are the steps to fix this error using the correct encoding:

Step 1: Determine the Encoding

The first step is to determine the encoding of the byte sequence. You can use the chardet library to automatically detect the encoding:

import chardet

with open('file.txt', 'rb') as f:
    data = f.read()

encoding = chardet.detect(data)['encoding']

Step 2: Decode the Byte Sequence

Once you have determined the encoding, you can decode the byte sequence using the correct encoding:

with open('file.txt', 'r', encoding=encoding) as f:
    data = f.read()

Step 3: Handle Errors

If the byte sequence contains invalid characters that cannot be decoded using the specified encoding, you can handle the errors using the errors parameter:

with open('file.txt', 'r', encoding=encoding, errors='replace') as f:
    data = f.read()

The errors parameter can take the following values:

  • 'strict': raise a UnicodeDecodeError if the byte sequence contains invalid characters
  • 'ignore': ignore the invalid characters and continue decoding
  • 'replace': replace the invalid characters with the Unicode replacement character U+FFFD

Step 4: Encode the Unicode String

If you need to encode the Unicode string back to bytes, you can use the encode() method:

data = 'Hello, world!'
encoded_data = data.encode(encoding)

Here, encoding is the encoding used to decode the byte sequence.

That’s it! By following these steps, you should be able to fix the UnicodeDecodeError with the message «invalid continuation byte» in Python by using the correct encoding.

Method 2: Check the data for invalid characters

If you are working with text data in Python, you may encounter the UnicodeDecodeError: invalid continuation byte error. This error occurs when you try to decode a string that contains invalid characters or bytes. In this tutorial, we will show you how to fix this error by checking the data for invalid characters.

Step 1: Read the File in Binary Mode

The first step is to read the file in binary mode using the rb mode instead of the r mode. This will ensure that the file is read as bytes and not as text.

with open('file.txt', 'rb') as file:
    data = file.read()

Step 2: Decode the Data

The next step is to decode the data using the appropriate encoding. In this example, we will use the utf-8 encoding.

try:
    text = data.decode('utf-8')
except UnicodeDecodeError:
    pass

Step 3: Check for Invalid Characters

Now that we have decoded the data, we can check for invalid characters using the isprintable() method. This method returns True if all the characters in the string are printable, otherwise it returns False.

invalid_chars = []
for char in text:
    if not char.isprintable():
        invalid_chars.append(char)

Step 4: Replace Invalid Characters

Finally, we can replace the invalid characters with a valid character using the replace() method.

for char in invalid_chars:
    text = text.replace(char, '')

Full Example

Here is the full example:

with open('file.txt', 'rb') as file:
    data = file.read()

try:
    text = data.decode('utf-8')
except UnicodeDecodeError:
    pass

invalid_chars = []
for char in text:
    if not char.isprintable():
        invalid_chars.append(char)

for char in invalid_chars:
    text = text.replace(char, '')

This code will read the file in binary mode, decode the data using the utf-8 encoding, check for invalid characters, and replace them with a valid character. This should fix the UnicodeDecodeError: invalid continuation byte error.

Method 3: Use a try-except block to handle the error

To fix the UnicodeDecodeError: 'utf-8' codec can't decode byte... error in Python, you can use a try-except block to handle the error. Here’s an example code snippet:

try:
    with open('file.txt', 'r', encoding='utf-8') as f:
        text = f.read()
except UnicodeDecodeError:
    with open('file.txt', 'r', encoding='ISO-8859-1') as f:
        text = f.read()

In this code, we try to open the file with UTF-8 encoding. If there’s a UnicodeDecodeError, we catch it with the except block and try to open the file again with ISO-8859-1 encoding.

You can also wrap the file reading code in a function to make it more reusable:

def read_file(filename):
    try:
        with open(filename, 'r', encoding='utf-8') as f:
            text = f.read()
    except UnicodeDecodeError:
        with open(filename, 'r', encoding='ISO-8859-1') as f:
            text = f.read()
    return text

This function takes a filename as an argument and returns the file’s contents. If there’s a UnicodeDecodeError, it tries to open the file again with ISO-8859-1 encoding.

In summary, using a try-except block to handle the UnicodeDecodeError in Python involves trying to open the file with UTF-8 encoding, catching the error if it occurs, and trying to open the file again with another encoding (such as ISO-8859-1). This approach allows you to handle the error gracefully and continue with your program’s execution.

Method 4: Force decode using the «ignore» option

To fix the UnicodeDecodeError with the invalid continuation byte error in Python, you can force decode the string using the «ignore» option. Here’s how you can do it in Python:

with open('filename.txt', 'rb') as f:
    data = f.read()

try:
    decoded_data = data.decode('utf-8', 'ignore')
except UnicodeDecodeError as e:
    print(f"Error: {e}")

with open('new_filename.txt', 'w') as f:
    f.write(decoded_data)

In this example, we first read the file in binary mode using rb. This is necessary because the file contains invalid bytes that can’t be decoded directly. Then, we use the decode() method to decode the data using the «ignore» option. This option tells Python to ignore any invalid bytes and continue decoding the rest of the string. If there are still invalid bytes left after decoding, they will be replaced with the «replacement character» (U+FFFD). Finally, we write the decoded data to a new file in text mode using w.

Note that this method may result in some data loss, as any invalid bytes will be ignored or replaced with the «replacement character». If you want to preserve all the data in the file, you may need to use a different method, such as manually fixing the invalid bytes or using a different encoding.

Usually, there should be no problem working with Latin characters. Except when interacting with special characters, we can see the “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte”.

Why does the “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte” appear? And how to solve it?

Encode and decode 2 different character sets

The error appears when we encode with one character set and try to use a different character set when we want to decode an object. See the example for a better understanding.

encoding = 'LearnShäreIT'.encode('latin-1')
decoding = encoding.decode('utf-8')

print(decoding) # UnicodeDecodeError

Error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 7

To solve this error, you must use the character set that was previously used for encoding when you decode the string you want, like the code sample below.

encoding = 'LearnShäreIT'.encode('utf-8')

# Using the same character set
decoding = encoding.decode('utf-8')

print(decoding)

Output:

LearnShäreIT

The charset is inconsistent when saving files and reading files

When we create and save a CSV file, we choose the UTF-16 BE charset, as shown below.

But when reading the file with pandas.read_csv(), we use the default character set of read_csv() which is utf-8. See the code below for a better understanding.

import pandas as pd

# Using encoding = 'utf-8' but charset of data.csv = 'utf-16'
data = pd.read_csv('data.csv')

print(data)

Error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0

We have to set the encoding='utf-16' for consistency between encoding and decoding. Like this:

import pandas as pd

# Using encoding='utf-16'
data = pd.read_csv('data.csv', encoding='utf-16')

print(data)

Output:

          Name           Website
0  LearnShareIT  learnshareit.com
1      Facebook      facebook.com
2        Google        google.com
3         Udemy         udemy.com

Using detect() function in the chardet package

You can use chardet to detect the character encoding of a file. This library is handy when working with a large pile of text. But it can also be used when working with downloaded data you don’t know its charset.

Syntax:

chardet.detect(data)

Parameter:

  • data: data in the file you want to detect charset.

The detect() function detects what charset a non-Unicode string is using. It returns a dictionary containing the automatically detected charset and confidence level.

Before using the detect() function, we need to install the chardet with the following command line:

pip install chardet

Then we will import the chardet at the top of the python file. Next, we pass the data into the detect() function to detect its charset. After getting the charset, pass it to the read_csv(). Like this:

import chardet
import pandas as pd

# Detect character encoding of data.csv
enc = chardet.detect(open('data.csv', 'rb').read())

print(enc['encoding'])  # UTF-16

# Use pandas to read data.csv
data = pd.read_csv('data.csv', encoding=enc['encoding'])

print(data)

Output:

UTF-16
          Name           Website
0  LearnShareIT  learnshareit.com
1      Facebook      facebook.com
2        Google        google.com
3         Udemy         udemy.com

Change character encoding manually

This way is very simple. Just open the file you need to read with notepad++. On the menu bar, select Encoding -> Convert to UTF-8. Like this:

Code:

import pandas as pd

# Using pandas to read data.csv with charset = UTF-8
data = pd.read_csv('data.csv')

print(data)

Output:

          Name           Website
0  LearnShareIT  learnshareit.com
1      Facebook      facebook.com
2        Google        google.com
3         Udemy         udemy.com

Summary

Basically, the error “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte” comes from the inconsistency between the encoding and decoding processes. As long as you make sure to use a character set for encoding and decoding (such as UTF-8), you won’t get this error again.

Have a lucky day!

Maybe you are interested:

  • “unicodedecodeerror: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte”
  • UnicodeDecodeError: ‘charmap’ codec can’t decode byte
  • UnicodeDecodeError: ‘ascii’ codec can’t decode byte

Lopez

Hi, I’m Cora Lopez. I have a passion for teaching programming languages such as Python, Java, Php, Javascript … I’m creating the free python course online. I hope this helps you in your learning journey.


Name of the university: HCMUE
Major: IT
Programming Languages: HTML/CSS/Javascript, PHP/sql/laravel, Python, Java

Solution 1:[1]

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:

>>> b'xe9x80x80'.decode('utf-8')
u'u9000'

But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:

>>> u'xe9'.encode('utf-8')
b'xc3xa9'
>>> u'xe9'.encode('latin-1')
b'xe9'

(Note, I’m using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

Solution 2:[2]

I had the same error when I tried to open a CSV file by pandas.read_csv
method.

The solution was change the encoding to latin-1:

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

Solution 3:[3]

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.

If you don’t know the codeset you’re receiving strings in, you’re in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you’d just reject ones that didn’t decode.

If you can’t do that, you’ll need heuristics.

Solution 4:[4]

Because UTF-8 is multibyte and there is no char corresponding to your combination of xe9 plus following space.

Why should it succeed in both utf-8 and latin-1?

Here how the same sentence should be in utf-8:

>>> o.decode('latin-1').encode("utf-8")
'a test of xc3xa9 char'

Solution 5:[5]

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode

Solution 6:[6]

Use this, If it shows the error of UTF-8

pd.read_csv('File_name.csv',encoding='latin-1')

Solution 7:[7]

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.

the reason to raise this exception is:

1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

In order to to overcome this we have a set of encodings, the most widely used is «Latin-1, also known as ISO-8859-1»

So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1

when this exception occurs when you are trying to load a data set ,try using this format

df=pd.read_csv("top50.csv",encoding='ISO-8859-1')

Add encoding technique at the end of the syntax which then accepts to load the data set.

Solution 8:[8]

This happened to me also, while i was reading text containing Hebrew from a .txt file.

I clicked: file -> save as and I saved this file as a UTF-8 encoding

Solution 9:[9]

Well this type of error comes when u are taking input a particular file or data in pandas such as :-

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)

Then the error is displaying like this :-
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xf4 in position 1: invalid continuation byte

So to avoid this type of error can be removed by adding an argument

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')

Solution 10:[10]

In this case, I tried to execute a .py which active a path/file.sql.

My solution was to modify the codification of the file.sql to «UTF-8 without BOM» and it works!

You can do it with Notepad++.

i will leave a part of my code.

con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])

cursor = con.cursor()
sqlfile = open(path, 'r')

Solution 11:[11]

TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.

I got this error as I was processing a large number of zip files with additional zip files in them.

My workflow was the following:

  1. Read zip
  2. Read child zip
  3. Read text from child zip

At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.

Solution 12:[12]

The solution was change to «UTF-8 sin BOM»

Понравилась статья? Поделить с друзьями:
  • Ошибка protect на музыкальном центре sony
  • Ошибка psn ce 33992 6
  • Ошибка protect на музыкальном центре samsung
  • Ошибка psn 80710016 что это
  • Ошибка protect на магнитоле samsung