Well here’s a curly issue I blundered into today. I was trying to replicate a Postgres database from an old server to a new one using pg_dump/pg_restore and found some errors like this:
invalid byte sequence for encoding "UTF8": 0x92
After some investigation it turns out the older version of Postgres defaults to ASCII encoding of the data wheras the new version of Postgres defaults to UTF-8. Of course I would much prefer to use UTF-8 but shouldn’t those two encodings be compatible anyway? UTF-8 is supposed to be backward-compatible with ASCII.
What happened is a few non-ASCII characters got into the database. 0×92 is not a valid ASCII code. It is a code from Latin-1 encoding. This was in the database thanks to my business partner who was entering some crap into the database using a Macintosh which wasn’t following the ASCII standard so much.
So how to solve it?
First I did a plain-text export of the database on the old server
pg_dump -Fp my_database > mydatabase.sql
Then I converted the encoding to UTF-8
iconv -f 8859_1 -t UTF-8 mydatabase.sql> mydatabase.utf8.sql
Finally I imported the converted data into the database on the new server
createdb my_database psql my_database -f mydatabase.utf8.sql
And then I wrote a blog post so I’d remember all this next time the issue comes up.