On Tue, Feb 17, 2009 at 12:40 PM, mike <mike503@xxxxxxxxx> wrote: > Pardon the messy code, but I got this working like a charm. Then I > went to try it on some Russian content and it broke. The inbound was > utf-8 encoded Russian characters, output was something else > unintelligible. > > I found a PHP bug from years ago that sounded related but the user had > a workaround. > > Note that it does not appear that any of the functions break the > encoding - it is the ->saveHTML() that doesn't seem to work (I also > tried saveXML() and it did not work either? > > I am totally up for changing out using php's DOM and using another > library, basically I just want to traverse the DOM and pick out all <a > href> and <img src> and possibly any other external references in the > documents so I can run them through some link examination and such. I > figured I may have to fall back to a regexp, but PHP's DOM was so good > with even partial and malformed HTML, I was excited at how easy this > was... > > $dom = new domDocument; > @$dom->loadHTML($string); > $dom->preserveWhiteSpace = false; > $links = $dom->getElementsByTagName('a'); > foreach($links as $tag) { > $before = $tag->getAttribute('href'); > $after = strip_chars($before); > $after = map_url($after); > $after = fix_link($after); > if($after != false) { > echo "\tBEFORE: $before\n"; > echo "\tAFTER : $after\n\n"; > $tag->removeAttribute('href'); > $tag->setAttribute('href', $after); > } > } > return $dom->saveHTML(); > } > > I tried things like this: > > new DomDocument('1.0', 'UTF-8'); > > as well as encoding options for $dom like $dom->encoding = 'utf-8' or > something (I tried so many variations I cannot remember anymore) > > Anyone have any ideas? > > As long as it can read in the string (which is and should always be > UTF-8) and spit out UTF-8, I can make sure any of my functions are > UTF-8 safe that handle the data... from the manual on DOM, *Note*: DOM extension uses UTF-8 encoding. Use utf8_encode()<http://us.php.net/manual/en/function.utf8-encode.php>and utf8_decode() <http://us.php.net/manual/en/function.utf8-decode.php> to work with texts in ISO-8859-1 encoding or Iconv<http://us.php.net/manual/en/ref.iconv.php>for other encodings. -nathan