i tried that kind of stuff - it did not seem to work. i will try again... if anyone has any ideas i.e. "use iconv to convert to A, then use DOM stuff, then use iconv to move it back to UTF8..." etc. i am all ears. On Tue, Feb 17, 2009 at 12:46 PM, Nathan Nobbe <quickshiftin@xxxxxxxxx> wrote: > On Tue, Feb 17, 2009 at 12:40 PM, mike <mike503@xxxxxxxxx> wrote: >> >> Pardon the messy code, but I got this working like a charm. Then I >> went to try it on some Russian content and it broke. The inbound was >> utf-8 encoded Russian characters, output was something else >> unintelligible. >> >> I found a PHP bug from years ago that sounded related but the user had >> a workaround. >> >> Note that it does not appear that any of the functions break the >> encoding - it is the ->saveHTML() that doesn't seem to work (I also >> tried saveXML() and it did not work either? >> >> I am totally up for changing out using php's DOM and using another >> library, basically I just want to traverse the DOM and pick out all <a >> href> and <img src> and possibly any other external references in the >> documents so I can run them through some link examination and such. I >> figured I may have to fall back to a regexp, but PHP's DOM was so good >> with even partial and malformed HTML, I was excited at how easy this >> was... >> >> $dom = new domDocument; >> @$dom->loadHTML($string); >> $dom->preserveWhiteSpace = false; >> $links = $dom->getElementsByTagName('a'); >> foreach($links as $tag) { >> $before = $tag->getAttribute('href'); >> $after = strip_chars($before); >> $after = map_url($after); >> $after = fix_link($after); >> if($after != false) { >> echo "\tBEFORE: $before\n"; >> echo "\tAFTER : $after\n\n"; >> $tag->removeAttribute('href'); >> $tag->setAttribute('href', $after); >> } >> } >> return $dom->saveHTML(); >> } >> >> I tried things like this: >> >> new DomDocument('1.0', 'UTF-8'); >> >> as well as encoding options for $dom like $dom->encoding = 'utf-8' or >> something (I tried so many variations I cannot remember anymore) >> >> Anyone have any ideas? >> >> As long as it can read in the string (which is and should always be >> UTF-8) and spit out UTF-8, I can make sure any of my functions are >> UTF-8 safe that handle the data... > > from the manual on DOM, > > Note: DOM extension uses UTF-8 encoding. Use utf8_encode() and utf8_decode() > to work with texts in ISO-8859-1 encoding or Iconv for other encodings. > > -nathan > > -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php