Peter C. Lai wrote: > The doublequotes isn't UTF8 it's people copying and pasting from Microsoft > stuff, which is WIN-1252. So try to use that with iconv instead of utf8 > > On 2010-08-16 12:40:03PM -0500, Karl Denninger wrote: > >> So I have myself a nice pickle here..... >> >> I've got a database which was originally created with SQL_ASCII for the >> encoding (anything goes text fields) >> >> Unfortunately, I have a bunch of data that was encoded in UTF-8 that's >> in an RSS feed that I need to load into said database. iconv barfs all >> over this file in an attempt to turn it into ISO-8859 (which is what the >> web application currently attached to that database is emitting and >> collecting.) It appears the problem is (mostly) things like the >> stylized double-quotes. >> >> So I figured I'd go the other way, and convert what I have now in the >> tables into UTF8. >> >> Well, except that doesn't work either. >> >> ticker=# select convert_to(subject, 'utf8') from post where >> ordinal='2098167'; >> convert_to >> ---------------------------------------------------------------- >> 1%: the interest rate on IBM\222s most recent three-year bond. >> (1 row) >> >> \222 is the correct code point for the styled single apostrophe that is >> in that place in ISO-8859-1 in the source. However, the UTF prefix is >> missing, as are the other two code-point characters (that is, I got the >> code point but not the other two bytes that should be in front of it. >> And if I set the code page on the web site to UTF-8, and also set the >> encoding on the SQL session to UTF-8 I don't get the three-byte code - I >> just get the one byte. That's a bust. >> >> There are TWO fields in this database that need converted. I presumed I >> could do something like this: >> >> # update post set field1 = convert_to(field1, 'utf8'); >> >> It runs to completion without complaint but produces the above. No good. >> >> So.... is there a way to do this? I do NOT want to dump, iconv on the >> dumped file, then reload the database if I can possibly avoid it. Yes, >> I know that will work (I've tested it on my development box), but being >> able to do this "hot" would be DRAMATICALLY preferred. If I do the >> iconv route I am basically rebuilding the entire database with all >> that's involved in doing so in terms of downtime and such. VERY >> undesirable. >> >> (Equally-valid would be a way to convert UTF-8 to ISO-8859-1 (Latin1), >> thereby allowing me to convert the incoming data stream to what's >> already in the system, but thus far I've found no joy on that at all.) >> >> Ideas? >> >> -- Karl >> No, the problem is that the existing data in the database and the web app that are using it are both ISO-8859-1. The issue is that the data I need to load INTO the system is in UTF-8 (and really is), and is full of three-byte escapes. iconv barfs all over it trying to go to ISO-8859-1, so I can't convert the INCOMING data to what's in the system now. It also won't convert it to Windows-1252 (which is kind of a superset of ISO-8859, and thus the server might not toss on it too badly.) FS/karl:~/tmp> iconv -f UTF-8 -t WINDOWS-1252 <rss-marketticker.php >/dev/null iconv: (stdin):2766:6125: cannot convert FS/karl:~/tmp> iconv -f UTF-8 -t ISO-8859-1 <rss-marketticker.php >/dev/null iconv: (stdin):222:391: cannot convert The data set that is incoming is some ~30,000 records. The data in the system already is ~2m records. Obviously, if I can convert the incoming data that's the better choice, but it appears I can't. Thus the desire to go the other way - turn the existing data in the database into UTF-8, which is probably desirable in the future anyway. I can iconv a pg_dumpall of the database TO UTF-8 and it succeeds (at least in theory), but the database into which I want to load this data set is HUGE and dumping and reloading it isn't on my "A" list of things to do. If I have to I have to - but if I can avoid this I certainly WANT TO. What I don't understand is why the Postgres function "convert_to()" doesn't emit the THREE BYTE sequence, and only emits the codepoint. -- Karl
begin:vcard fn:Karl Denninger n:Denninger;Karl email;internet:karl@xxxxxxxxxxxxx x-mozilla-html:TRUE version:2.1 end:vcard
-- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general