Have you ever seen a stream of data coming from a network, and it has some European accented characters in an encoding you don’t recognize? Sometimes bad coding practices or assumptions about encoding when pasting into documents make the encoding on the file not match all or part of the encoding of a document. This is a quick way to find out what encoding(s) match.
It’s not fully automated, it still requires your eyes. But it can make a difference when you’re writing parsing code and you don’t know what to do with some edge cases. Maybe some code like this coupled with a spell checker inside the loop would give you some sense of automation.
- First, install iconv (on a Mac, use sudo port install iconv).
- Next, use curl (sudo port install curl if you don’t have it) to get the stream and save it to a file, or copy/paste the section you see looking strange.
- Now write this program and run it:
#!/usr/bin/ruby
CHARSETS=`iconv -l | xargs`.split(' ')
#puts CHARSETS.join(',')
RESULTS={}
CHARSETS.each { |charset|
#puts "Trying: #{charset}"
RESULTS[charset] = `cat untitled\\ thefile.txt | iconv --from-code=#{charset} 2>&1`
}
RESULTS.each{|charset, result|
puts "#{charset} - #{result.sub("\n", ' ')}"
}
Now look at the results. Sometimes, more than one may match.
Python has a fairly nice chardet lib for this: http://chardet.feedparser.org/