validating and sanitizing input string encoding

Tom Worster <fsb@xxxxxxxxxx> · Fri, 27 Mar 2009 15:56:31 -0400

the article at http://devlog.info/2008/08/24/php-and-unicode-utf-8, among
other web pages, suggests checking for valid utf-8 string encoding using
(strlen($str) && !preg_match('/^.{1}/us', $str)). however, another article,
http://www.phpwact.org/php/i18n/charsets, says this cannot be trusted. i
work exclusively with mbstring environments so i could use
mb_check_encoding().

which leads to the question of what to do if mb_check_encoding() indicates
bad input?

i don't want to throw the form back to the user because most of my users
will not be able to rectify the input. errors in the data are undesirable,
of course, but in my application, no disastrous. so i'm inclined to the
approach mentioned here:
http://blog.liip.ch/archive/2005/01/24/how-to-get-rid-of-invalid-utf-8-chara
cters.html, i.e. iconv("UTF-8","UTF-8//IGNORE",$t), which will quietly
eliminate badly formed characters and move on (iconv will throw a notice on
bad utf-8).

so i'm considering using a function like this:

function clean_input(&$a) {
    if ( is_array($a) && !empty($a) )
        foreach ($a as $k => &$v)
            clean_input($v);
    elseif ( is_string($a) && !mb_check_encoding($a, 'UTF-8'))
        $a = iconv('UTF-8', 'UTF-8//IGNORE', $a);
}

and calling it on $_POST or $_GET as appropriate at the stop of any script
that uses those superglobals.

it seems a bit lazy to me but that's my nature and i think this might be
good enough. any thoughts?

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php