The underlying problem is that there is no one "Extended ASCII". There's a dozen or so different schemes for the 129-255 range, all of them incompatible. That's why Unicode now exists. :-) I've been tracking this issue myself on my blog, which may be of background use for you (especially some of the pages I link, which give more theoretical background): http://www.garfieldtech.com/blog/stupid-quotes http://www.garfieldtech.com/blog/more-on-stupid-quotes http://www.garfieldtech.com/blog/unicode-8-vs-16 Odds are that the pages you're scraping are saved as ISO 8859-1, but actually contain Windows-1252. If you try to save Windows-1252 as UTF-8, then the characters in the extended range (129-255) will get mangled. You may find this useful: http://us.php.net/manual/en/function.mb-convert-encoding.php It is frequently reasonably good at guessing the incoming character set, at least the one time we used it at work. Try running all pages/strings through that first before sending them to the database to convert everything to UTF-8. You should also make sure that not only is the MySQL connection set to UTF-8, the tables and columns themselves are, too. MySQL lets you vary the encoding by table and field, so you should check to make sure that *everything* is UTF-8 explicitly. Cheers. PS: The free copy of php|architect from php|tek had a really good article on Unicode. Even if you don't read anything else from that issue, read that. :-) On Monday 11 June 2007, Richard Lynch wrote: > This may actually be a MySQL question... > > Or not. > > I'm scraping about 55,000 pages from a website into a MySQL database. > > Some of these pages have "extended ASCII" values in their content, or, > in some cases, just plain junk ASCII values, as far as I can tell. > > For example, decimal 163 is sometimes used to represent the UK > monetary symbol for a Pound. > > Unforunately, when I insert/update the text into the database, the > text is chopped off, as far as I can tell, at any extended ASCII > value. > > Now, I've set things up for UTF-8, expressly to avoid this kind of > problem, I thought: > > var_dump(mysql_get_server_info($connection)); > var_dump(mysql_get_client_info()); > var_dump(mysql_client_encoding($connection)); > > string(10) "5.0.26-log" > string(6) "5.0.26" > string(4) "utf8" > > I'm open to any advice about the correct solution to convert "extended > ASCII" as typed in emails by tens of thousands of users on diverse > systems, from diverse countries... > > Note that "extended ASCII" is inconsistent from Microsoft to Apple to > Unix to ..., as far as I understand it, so I really can't be sure > which charset the original user was using. > > Of course, for now, I'll just change any extended ASCII into space and > move on with life... But that is not what I would like to end up with > in the long run. > > And why is MySQL not just taking these extended ASCII chars in the > first place? Seems to me that UTF-8 encoding should accept them, no? > > Disclaimer: I am so NOT hip to this encoding stuff... > > -- > Some people have a "gift" link here. > Know what I want? > I want you to buy a CD from some indie artist. > http://cdbaby.com/browse/from/lynch > Yeah, I get a buck. So? -- Larry Garfield AIM: LOLG42 larry@xxxxxxxxxxxxxxxx ICQ: 6817012 "If nature has made any one thing less susceptible than all others of exclusive property, it is the action of the thinking power called an idea, which an individual may exclusively possess as long as he keeps it to himself; but the moment it is divulged, it forces itself into the possession of every one, and the receiver cannot dispossess himself of it." -- Thomas Jefferson -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php