PLEASE help, this is driving me crazy - is saveHTML() etc not UTF-8 capable?

mike <mike503@xxxxxxxxx> · Tue, 17 Feb 2009 11:40:07 -0800

Pardon the messy code, but I got this working like a charm. Then I
went to try it on some Russian content and it broke. The inbound was
utf-8 encoded Russian characters, output was something else
unintelligible.

I found a PHP bug from years ago that sounded related but the user had
a workaround.

Note that it does not appear that any of the functions break the
encoding - it is the ->saveHTML() that doesn't seem to work (I also
tried saveXML() and it did not work either?

I am totally up for changing out using php's DOM and using another
library, basically I just want to traverse the DOM and pick out all <a
href> and <img src> and possibly any other external references in the
documents so I can run them through some link examination and such. I
figured I may have to fall back to a regexp, but PHP's DOM was so good
with even partial and malformed HTML, I was excited at how easy this
was...

        $dom = new domDocument;
        @$dom->loadHTML($string);
        $dom->preserveWhiteSpace = false;
        $links = $dom->getElementsByTagName('a');
        foreach($links as $tag) {
                $before = $tag->getAttribute('href');
                $after = strip_chars($before);
                $after = map_url($after);
                $after = fix_link($after);
                if($after != false) {
                        echo "\tBEFORE: $before\n";
                        echo "\tAFTER : $after\n\n";
                        $tag->removeAttribute('href');
                        $tag->setAttribute('href', $after);
                }
        }
        return $dom->saveHTML();
}

I tried things like this:

new DomDocument('1.0', 'UTF-8');

as well as encoding options for $dom like $dom->encoding = 'utf-8' or
something (I tried so many variations I cannot remember anymore)

Anyone have any ideas?

As long as it can read in the string (which is and should always be
UTF-8) and spit out UTF-8, I can make sure any of my functions are
UTF-8 safe that handle the data...

Thanks

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php