Re: Special HTML characters question.

"Richard Lynch" <ceo@xxxxxxxxx> · Tue, 23 Aug 2005 00:18:46 -0700 (PDT)

On Mon, August 22, 2005 6:29 am, Jay Paulson wrote:
> I have a problem that I'm sure some of you have run into before,
> therefore I hope you all know of an easy solution.  Some of my users
> are cutting and pasting text from Word into text fields that are being
> saved into a database then from that database being displayed on a web
> page.  The problem occurs when some special characters are being used.
> Double quotes, single quotes, and other characters like accents etc
> have the special html code like &quote; etc replacing the special
> characters.  What methods are being used to combat this issue?  Is
> there a solution out there to run text through some sort of filter
> before submitting it to the database to look for these special
> characters and then replacing them?

You are not alone.

There are innumerable User Contributed notes in the on-line manual on
this topic, scattered through various function pages.

Some of them have some rather nice functions for finding and replacing
all the Microsoft crap with something that will work in all browsers.

I would recommend doing this before you insert to the database.

Yes, it's generally a Bad Idea to munge data before inserting, but in
this case, you will never, ever, ever want those nasty characters
again.

They have no meaning outside the context of MS Word, except, of
course, MS IE "embrace and extend" (cough, cough) will display them as
well, so all the MS Word, FrontPage, IE users tend to use them, not
realizing just how UGLY their site is for everybody else.

So unless you want to abandon all other browsers, and turn into a
Microsoft drone, you might as well replace them once, at insert, and
be done with it.

You don't need anything as fancy as a Regex, and the list of
characters has already been painstakingly built for you by people in
the User Contributed notes.

For that matter, they wrote the function you can copy/paste as well.

Here's the one I liked enough to steal:

  function un_microsuck($text){
    static $chars = array(
        128 => '&#8364;',
        130 => '&#8218;',
        131 => '&#402;',
        132 => '&#8222;',
        133 => '&#8230;',
        134 => '&#8224;',
        135 => '&#8225;',
        136 => '&#710;',
        137 => '&#8240;',
        138 => '&#352;',
        139 => '&#8249;',
        140 => '&#338;',
        142 => '&#381;',
        145 => '&#8216;',
        146 => '&#8217;',
        147 => '&#8220;',
        148 => '&#8221;',
        149 => '&#8226;',
        150 => '&#8211;',
        151 => '&#8212;',
        152 => '&#732;',
        153 => '&#8482;',
        154 => '&#353;',
        155 => '&#8250;',
        156 => '&#339;',
        158 => '&#382;',
        159 => '&#376;');
    $text = str_replace(array_map('chr', array_keys($chars)), $chars,
$text);
    return $text;
  }

That's not going to be TOO horribly slow unless you've got a TON of
data to run through.

Feel free to benchmark it and see.

I never have, since I'm only handling one or two INPUT fields from
copy/paste at a time by a site administrator.

If it's "too slow" you might be able to tweak it so the array_map()
result is cached in the function body to get a bit more speed out of
it.

Everything else from MS Word copy/paste can be handled by
htmlentities() upon OUTPUT to the browser.

You may need to re-name the function for inclusion in a day-job
corporate project/product... :-)

-- 
Like Music?
http://l-i-e.com/artists.htm

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php