I would like to echo the sentiment on collation and expand it to character sets in general. When issues with them come up, they do take an incredible amount of time and effort to resolve, and are one of my own biggest pain points when dealing with databases and datasets from other sources. Case in point: I have one dataset that I pull into my local (Windows laptop) for analysis on a weekly basis. Invariably it is sprinkled with special characters that trip up the copy command; these appear to come from people fat-fingering in the data input and the system not checking it. It only takes one of these special characters to trip up and negate the entire result of the copy command. After wrestling with this for a couple days about a year ago, the workaround I found that works is to first import it into a MySQL table, strip out the characters in MySQL, dump the data out to a CSV and finally bring the sanitized data into Postgres using the copy command. So yes, character set and collations sets are the bane of DBAs, and yes, they are a bitch to troubleshoot and yes, they often take very creative solutions to fix. Michael
|