Coding Hell

Programming and stuff.

File Handling: Never Trust Default Encoding Settings

Whenever you are doing any kind of file shenanigans, you should always explicitly set the expected encoding of the file you are reading from. Most programming languages’ standard file handling libraries use encoding settings by default which do not lead to the desired behaviour.

Example code

Let me demonstrate the potential problems of the default encoding settings with the following example code:

example.rb
1
2
data = File.read('test.txt')
puts data.gsub(/Example/, 'Coding')
test.txt
1
Example with ❤

Now when I connect to a server with that code installed via SSH from my Mac, everything works perfectly fine:

1
2
3
4
$ ruby example.rb
Coding with ❤
$ echo $?
0

But when I connect from a Windows machine to the same server and do the same thing, I end up with a different result:

1
2
3
4
5
$ ruby example.rb
example.rb:2:in `gsub': invalid byte sequence in US-ASCII (ArgumentError)
        from example.rb:2:in `<main>'
$ echo $?
1

Behind the scenes

When you connect to a server via SSH, modern clients send environment variables to pass on your local locale settings. You can see this once you enable the debug output of your SSH client:

1
2
3
4
debug1: Sending environment.
debug1: Sending env LC_ALL = en_US.UTF-8
debug1: Sending env LANG = en_US.UTF-8
debug1: Sending env LC_CTYPE = en_US.UTF-8

If these variables are omitted, you end up with a default locale, which in my case was a non-unicode version of en_US.

The same problem will emerge if your script is executed as a cron job: The cron daemon normally also ends up with the systems default locale settings.

Explicit encoding to the rescue

When you are reading data from a file, you should always explicitly define the encoding you expect the file to be in:

example.rb
1
2
data = File.read('test.txt', encoding: 'UTF-8')
puts data.gsub(/Example/, 'Coding')

This way you end up with portable code that executes in a predictable way independent of environment settings. Just make sure to specify the encoding your application uses for file handling in your documentation.

The same concepts apply to other programming languages as well. Java’s java.io.FileReader for example is completely useless because it does not allow you to state the encoding manually:

The constructors of this class assume that the default character encoding and
the default byte-buffer size are appropriate. To specify these values yourself,
construct an InputStreamReader on a FileInputStream.

Comments