Search Postgresql Archives

Re: Question, re: Running enormous batch file from the command line

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Milarsky, Jeremy F. wrote:

I have, for the record, attempting to get all 10 million rows in with a
single COPY command. This method has not been successful, apparently due
to client encoding (despite executing several different "set
client_encoding" in each attempt).

This is probably something you should investigate, rather than work around, as it may indicate that you're loading incorrectly encoded data into your DB.

What is your database encoding? What encoding does the CSV dump from the DBF files use? What is your system's default text encoding? What encoding were you telling psql to use?


Personally, to fix this I'd start by using the wonderful `iconv' tool to convert the CSV data from its original encoding (probably one of the WIN- codepages, but you need to find out) into UTF-8. If you encounter any encoding errors in this process, you need to look at what's going on there and determine if your source data is in a different encoding to what you thought it was (and use the right one). If it turns out that different parts of the data are in different encodings, that's something you will need to clean up.

You have two options for cleaning up such mis-encoded data. One way is to do the cleanup on the CSV data - say, using Python and the `csv' module to load it, test it, and make whatever repairs you deem necessary on a record-by-record basis before writing out a fixed CSV file.

The other way is to load the data into PostgreSQL as `bytea' or using the SQL-ASCII encoding, then do the cleanup in PostgreSQL. Personally I suspect a little Python cleanup script is easiest. The "dumb" way would be to try str.decode(...) and if it throws an exception catch it and try the next encoding in your list of suspects. Of course, you can do much smarter things than that with a little bit more work and some examination of the faulty data.

Once you have CSV input data of a known and correct encoding (say, UTF-8) you can just "set client_encoding" as appropriate and \copy from psql.

--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux