Search Postgresql Archives

Re: How to remove non-UTF values from a table?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Phoenix Kiula wrote:
On Tue, Dec 15, 2009 at 9:26 PM, Howard Cole <howardnews@xxxxxxxxxxxxx> wrote:
Phoenix Kiula wrote:
An easy question for some I hope.

I have a DB from 8.2 days that when I now dump and try to take into
the 8.3.7, it gives me errors about utf-8 stuff.

I tried searching this list's archives but could not come up with an
answer.

Google returns some sites like these:
http://sniptools.com/databases/finding-non-utf8-values-in-postgresql -
but I'm not clear on how to use them.

Following the SQL on this site I could identify some columns that
contain text like this:

   "Évolution générale de la situation démographique"

So my guess is that the non-English characters were originally not
getting written in proper utf-8 variants.

Is there any SQL possibility to find these columns and replace them
with utf-8 equivalents using some postgresql commands? Couldn't find
anything in the "Strings functions" (chapter 9 of manual).

We're on CentOS.

Thanks!

My recommendation would be to install the iconv utility and run it on a
plain text (pg_dump -Fp) backup as suggested in the google article - and
then reimport the clean UTF-8.

I am surprised that you managed to install the original backup on 8.3
because it seems to be much more strict on encoding - Unless your database
is not in UTF-8?



Thanks Howard.

I ran the SQL and it finds anything that has non-English characters.
For example:


http://www.amazon.co.jp/%E3%83%A4%E3%83%9E%E3%83%80%E9%9B%BB%E6%A9%9F%E3%81%AE%E5%93%81%E6%A0%BC%E2%80%95No-1%E4%BC%81%E6%A5%AD%E3%81%AE%E6%BF%80%E5%AE%89%E5%93%B2%E5%AD%A6-%E7%AB%8B%E7%9F%B3-%E6%B3%B0%E5%89%87/dp/406214378X/ref=sr_1_1?ie=UTF8&s=books&qid=1199212694&sr=8-1


Part of this URL is actually in Japanese, but when I paste it in this
email it comes up with all these percentage signs. I suppose this is
"url encoded".

Shouldn't this be valid UTF-8? How does PG calculate if something is
not valid UTF-8?

Thanks.

I am sure that postgres uses standard test to see if text is valid UTF8, however I guess you will only get a warning when you try to read or write the data, and the warning only appears in the log as far as I am aware. Encoding checking has become more strict with each upgrade to postgres, this is why you are seeing errors as you import into 8.3 from an 8.2 backup.

I still maintain that you should check and fix using iconv rather than SQL. Read your distro notes on how to install iconv (if it isn't already installed) and run it on your plain text backup. You can then pinpoint individual changes using diff if you want to find it in your exising 8.2 database. Better still, just fix using Iconv then import the clean data into 8.3 or 8.4

Howard Cole
www.selestial.com


--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux