If you have to deal with uploads from real-life users, you run into encoding issues from time to time. This is especially pernicious with CSV files, which Excel loves to export with strange and Windows-specific latin codecs.
What’s with the garbled characters? What the heck darn kind of outfit are you running here?
^ Feedback from a client, probably
Enter chardet
chardet is a character encoding detector from the good folks at Mozilla. While originally written for C++, we’ll be using the Python port.
pip install chardet
chardet looks at the data you feed it and makes an educated guess about the codec. Here’s a very basic codec guessing function:
from chardet.universaldetector import UniversalDetector
DEFAULT_ENCODING = "utf-8"
def guess_codec(file_name: str) -> str:
codec_detector = UniversalDetector()
with open(file_name, mode="rb") as file:
while True:
line = file.readline()
if not line:
break
codec_detector.feed(line)
if codec_detector.done:
break
result = codec_detector.close()
encoding = result.get("encoding")
return encoding or DEFAULT_ENCODING
I understand that if you’re dealing with user uploads, there’s a pretty tight ceiling on your happiness, but I hope this helps in a marginal way.