MySQL UTF-8 vs Extended ASCII

"Richard Lynch" <ceo@xxxxxxxxx> · Mon, 11 Jun 2007 18:31:20 -0500 (CDT)

This may actually be a MySQL question...

Or not.

I'm scraping about 55,000 pages from a website into a MySQL database.

Some of these pages have "extended ASCII" values in their content, or,
in some cases, just plain junk ASCII values, as far as I can tell.

For example, decimal 163 is sometimes used to represent the UK
monetary symbol for a Pound.

Unforunately, when I insert/update the text into the database, the
text is chopped off, as far as I can tell, at any extended ASCII
value.

Now, I've set things up for UTF-8, expressly to avoid this kind of
problem, I thought:

  var_dump(mysql_get_server_info($connection));
  var_dump(mysql_get_client_info());
  var_dump(mysql_client_encoding($connection));

string(10) "5.0.26-log"
string(6) "5.0.26"
string(4) "utf8"

I'm open to any advice about the correct solution to convert "extended
ASCII" as typed in emails by tens of thousands of users on diverse
systems, from diverse countries...

Note that "extended ASCII" is inconsistent from Microsoft to Apple to
Unix to ..., as far as I understand it, so I really can't be sure
which charset the original user was using.

Of course, for now, I'll just change any extended ASCII into space and
move on with life...  But that is not what I would like to end up with
in the long run.

And why is MySQL not just taking these extended ASCII chars in the
first place?  Seems to me that UTF-8 encoding should accept them, no?

Disclaimer: I am so NOT hip to this encoding stuff...

-- 
Some people have a "gift" link here.
Know what I want?
I want you to buy a CD from some indie artist.
http://cdbaby.com/browse/from/lynch
Yeah, I get a buck. So?

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php