Matt White

Matt White

developer

Matt White

developer

| blog
| categories
| tags
| rss

Chardet: Easy Codec Detection in Python

If you have to deal with uploads from real-life users, you run into encoding issues from time to time. This is especially pernicious with CSV files, which Excel loves to export with strange and Windows-specific latin codecs.

What’s with the garbled characters? What the heck darn kind of outfit are you running here?

^ Feedback from a client, probably

Enter chardet

chardet is a character encoding detector from the good folks at Mozilla. While originally written for C++, we’ll be using the Python port.

pip install chardet

chardet looks at the data you feed it and makes an educated guess about the codec. Here’s a very basic codec guessing function:

from chardet.universaldetector import UniversalDetector

DEFAULT_ENCODING = "utf-8"

def guess_codec(file_name: str) -> str:
    codec_detector = UniversalDetector()
    with open(file_name, mode="rb") as file:
        while True:
            line = file.readline()
            if not line:
                break

            codec_detector.feed(line)
            if codec_detector.done:
                break

    result = codec_detector.close()
    encoding = result.get("encoding")
    return encoding or DEFAULT_ENCODING

I understand that if you’re dealing with user uploads, there’s a pretty tight ceiling on your happiness, but I hope this helps in a marginal way.

Learn by doing.