Replicating ASCII Database to UTF-8 in PostgreSQL

Well here’s a curly issue I blundered into today. I was trying to replicate a Postgres database from an old server to a new one using pg_dump/pg_restore and found some errors like this:

invalid byte sequence for encoding "UTF8": 0x92

After some investigation it turns out the older version of Postgres defaults to ASCII encoding of the data wheras the new version of Postgres defaults to UTF-8. Of course I would much prefer to use UTF-8 but shouldn’t those two encodings be compatible anyway? UTF-8 is supposed to be backward-compatible with ASCII.

What happened is a few non-ASCII characters got into the database. 0×92 is not a valid ASCII code. It is a code from Latin-1 encoding. This was in the database thanks to my business partner who was entering some crap into the database using a Macintosh which wasn’t following the ASCII standard so much.

So how to solve it?

First I did a plain-text export of the database on the old server

pg_dump -Fp my_database > mydatabase.sql

Then I converted the encoding to UTF-8

iconv -f 8859_1 -t UTF-8 mydatabase.sql > mydatabase.utf8.sql

Finally I imported the converted data into the database on the new server
createdb my_database
psql my_database -f mydatabase.utf8.sql

And then I wrote a blog post so I’d remember all this next time the issue comes up.