Invalid start byte error using replace() function in python
Invalid start byte error using replace() function in python
I am running a simple code to replace a word with another in my files like so:
import random
import os
path = '/path/of/file/'
files = os.listdir (path)
for file in files:
with open (path + file) as f:
newText = f.read().replace('Plastic Ba','PlasticBag')
with open (path + file, "w") as f:
f.write(newText)
And in doing so I get an error that I have never encountered before :
Traceback (most recent call last):
File "replaceText.py", line 9, in <module>
newText = f.read().replace('Plastic Ba', 'PlasticBag')
File "/Users/vivek/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
I am not sure what this means or what the mistake here is? I have run this script multiple times in the past without any issues. Any help on resolving this would be great!
The
replace
is completely irrelevant here; the exception is coming from the read()
, before you even get there. And what the exception means is that the file is not UTF-8 (e.g., it's Latin-1 or cp1252), but you've tried to open it as UTF-8. (Or, possibly, that it's UTF-8 but corrupted, but that's less likely.)– abarnert
Aug 21 at 23:29
replace
read()
You could potentially resolve the problem by opening the file in binary mode and doing replacements only using byte strings. But probably the better solution is to open the file with the correct encoding (and yes, CP 1252 is probably a decent guess if it isn't UTF-8 but it is a superset of ASCII).
– Daniel Pryden
Aug 21 at 23:32
Specifically:
UnicodeDecodeError
means you're trying to read
/decode
/etc. text with the wrong encoding. 'utf-8'
is the encoding you're trying to use (it's the default for most things nowadays). byte 0x80 in position 3131
is helpfully telling you where the problem happens, so you can, e.g., with open(path+file, 'rb') as f: print(f.read()[3100:3200])
to debug the problem. (Or to post it on Stack Overflow so someone else can debug it.)– abarnert
Aug 21 at 23:32
UnicodeDecodeError
read
decode
'utf-8'
byte 0x80 in position 3131
with open(path+file, 'rb') as f: print(f.read()[3100:3200])
@DanielPryden Also, unlike Latin-1, cp1252 is a superset of ASCII where
x80
is '€'
, instead of a nonprinting control character, so… I probably should have suggested that one first.– abarnert
Aug 21 at 23:34
x80
'€'
1 Answer
1
Did you try to encode the file to 'UTF-8' ?
Please check the Open function parameters,
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
In your script, try using,
with open (path + file, 'r', encoding='windows-1252') as f:
You can also checkout the open method available in codecs library.
Please checkout this questions. Unicode (UTF-8) reading and writing to files in Python
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte
The error message in the question indicates that the
utf-8
codec is already being used (and failing to decode), so that can't be the answer.– Daniel Pryden
Aug 22 at 0:04
utf-8
What, you found a link to another question which fails to read a file and it fails at the same position
3131
, with the same error? That is weird.– zvone
Aug 22 at 0:19
3131
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
What is the encoding of the text file? Can you provide a sample of what the file looks like around the 3131st byte?
– Daniel Pryden
Aug 21 at 23:28