Re: MySQL UTF-8 vs Extended ASCII

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The underlying problem is that there is no one "Extended ASCII".  There's a 
dozen or so different schemes for the 129-255 range, all of them 
incompatible.  That's why Unicode now exists. :-)

I've been tracking this issue myself on my blog, which may be of background 
use for you (especially some of the pages I link, which give more theoretical 
background):

http://www.garfieldtech.com/blog/stupid-quotes
http://www.garfieldtech.com/blog/more-on-stupid-quotes
http://www.garfieldtech.com/blog/unicode-8-vs-16

Odds are that the pages you're scraping are saved as ISO 8859-1, but actually 
contain Windows-1252.  If you try to save Windows-1252 as UTF-8, then the 
characters in the extended range (129-255) will get mangled.  

You may find this useful:
http://us.php.net/manual/en/function.mb-convert-encoding.php

It is frequently reasonably good at guessing the incoming character set, at 
least the one time we used it at work.  Try running all pages/strings through 
that first before sending them to the database to convert everything to 
UTF-8.  

You should also make sure that not only is the MySQL connection set to UTF-8, 
the tables and columns themselves are, too.  MySQL lets you vary the encoding 
by table and field, so you should check to make sure that *everything* is 
UTF-8 explicitly.

Cheers.

PS: The free copy of php|architect from php|tek had a really good article on 
Unicode.  Even if you don't read anything else from that issue, read 
that. :-)

On Monday 11 June 2007, Richard Lynch wrote:
> This may actually be a MySQL question...
>
> Or not.
>
> I'm scraping about 55,000 pages from a website into a MySQL database.
>
> Some of these pages have "extended ASCII" values in their content, or,
> in some cases, just plain junk ASCII values, as far as I can tell.
>
> For example, decimal 163 is sometimes used to represent the UK
> monetary symbol for a Pound.
>
> Unforunately, when I insert/update the text into the database, the
> text is chopped off, as far as I can tell, at any extended ASCII
> value.
>
> Now, I've set things up for UTF-8, expressly to avoid this kind of
> problem, I thought:
>
>   var_dump(mysql_get_server_info($connection));
>   var_dump(mysql_get_client_info());
>   var_dump(mysql_client_encoding($connection));
>
> string(10) "5.0.26-log"
> string(6) "5.0.26"
> string(4) "utf8"
>
> I'm open to any advice about the correct solution to convert "extended
> ASCII" as typed in emails by tens of thousands of users on diverse
> systems, from diverse countries...
>
> Note that "extended ASCII" is inconsistent from Microsoft to Apple to
> Unix to ..., as far as I understand it, so I really can't be sure
> which charset the original user was using.
>
> Of course, for now, I'll just change any extended ASCII into space and
> move on with life...  But that is not what I would like to end up with
> in the long run.
>
> And why is MySQL not just taking these extended ASCII chars in the
> first place?  Seems to me that UTF-8 encoding should accept them, no?
>
> Disclaimer: I am so NOT hip to this encoding stuff...
>
> --
> Some people have a "gift" link here.
> Know what I want?
> I want you to buy a CD from some indie artist.
> http://cdbaby.com/browse/from/lynch
> Yeah, I get a buck. So?


-- 
Larry Garfield			AIM: LOLG42
larry@xxxxxxxxxxxxxxxx		ICQ: 6817012

"If nature has made any one thing less susceptible than all others of 
exclusive property, it is the action of the thinking power called an idea, 
which an individual may exclusively possess as long as he keeps it to 
himself; but the moment it is divulged, it forces itself into the possession 
of every one, and the receiver cannot dispossess himself of it."  -- Thomas 
Jefferson

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux