Re: PLEASE help, this is driving me crazy - is saveHTML() etc not UTF-8 capable?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Feb 17, 2009 at 12:40 PM, mike <mike503@xxxxxxxxx> wrote:

> Pardon the messy code, but I got this working like a charm. Then I
> went to try it on some Russian content and it broke. The inbound was
> utf-8 encoded Russian characters, output was something else
> unintelligible.
>
> I found a PHP bug from years ago that sounded related but the user had
> a workaround.
>
> Note that it does not appear that any of the functions break the
> encoding - it is the ->saveHTML() that doesn't seem to work (I also
> tried saveXML() and it did not work either?
>
> I am totally up for changing out using php's DOM and using another
> library, basically I just want to traverse the DOM and pick out all <a
> href> and <img src> and possibly any other external references in the
> documents so I can run them through some link examination and such. I
> figured I may have to fall back to a regexp, but PHP's DOM was so good
> with even partial and malformed HTML, I was excited at how easy this
> was...
>
>        $dom = new domDocument;
>        @$dom->loadHTML($string);
>        $dom->preserveWhiteSpace = false;
>        $links = $dom->getElementsByTagName('a');
>        foreach($links as $tag) {
>                $before = $tag->getAttribute('href');
>                $after = strip_chars($before);
>                $after = map_url($after);
>                $after = fix_link($after);
>                if($after != false) {
>                        echo "\tBEFORE: $before\n";
>                        echo "\tAFTER : $after\n\n";
>                        $tag->removeAttribute('href');
>                        $tag->setAttribute('href', $after);
>                }
>        }
>        return $dom->saveHTML();
> }
>
> I tried things like this:
>
> new DomDocument('1.0', 'UTF-8');
>
> as well as encoding options for $dom like $dom->encoding = 'utf-8' or
> something (I tried so many variations I cannot remember anymore)
>
> Anyone have any ideas?
>
> As long as it can read in the string (which is and should always be
> UTF-8) and spit out UTF-8, I can make sure any of my functions are
> UTF-8 safe that handle the data...


from the manual on DOM,

*Note*: DOM extension uses UTF-8 encoding. Use
utf8_encode()<http://us.php.net/manual/en/function.utf8-encode.php>and
utf8_decode() <http://us.php.net/manual/en/function.utf8-decode.php> to work
with texts in ISO-8859-1 encoding or
Iconv<http://us.php.net/manual/en/ref.iconv.php>for other encodings.

-nathan

[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux