On Wed, Sep 10, 2008 at 10:35 AM, Tim Gustafson <tjg@xxxxxxxxxxxx> wrote: > Nathan, > > Thanks for your help on this. > > I actually need to do this a different way I think though. The problem is > that I'm not just replacing a text entity with a link entity. For example, > consider this paragraph: > > <p>For information, please contact tjg@xxxxxxxxxxxxx</p> > > In this case, I want "tjg@xxxxxxxxxxxx" to be a link, but not the rest of > the paragraph. That means that the <p> entity has to be split into three > separate entities - one DOMText for "For information, please contact ", one > DOMEntity node for tjg@xxxxxxxxxxxx, and one DOMText node for ".". > > This seems doable with the DOM modle, but complicated. I'm thinking > regular > expressions might be the way to go again. :\ so use some regex :D thats the only way i know of to determine if DOMText nodes contain email address(s) as substrings while retaining ones sanity... i got it working, again by modifying the code from my original post and dropping in an additional clause which will use regex to determine if there is an email address embedded in a DOMText node, however, it checks to see if the whole thing is a mail first, cause i think thats a little optimization, but it could be ommitted. heres the output of the script now (notice i changed the input text to incorporate the new issue): nathan@devel ~/domIterator/initialTests $ php testDom.php IN: <?xml version="1.0" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" " http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body>Test<br/><h2><b>quickshiftin@xxxxxxxxx</b></h2><p>text that we dont want to turn into a link.. quickshiftin@xxxxxxxxx</p><a name="bar">stuff inside the link</a>Foo<p>care</p><p>yoyser</p></body></html> OUT: <?xml version="1.0" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" " http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body>Test<br/><h2><b><a href="mailto:quickshiftin@xxxxxxxxx"> quickshiftin@xxxxxxxxx</a></b></h2><p>text that we dont want to turn into a link.. <a href="mailto:quickshiftin@xxxxxxxxx">quickshiftin@xxxxxxxxx</a></p><a name="bar">stuff inside the link</a>Foo<p>care</p><p>yoyser</p></body></html> and here is the code; sorry for the lengthy post fellas, i just want to post all of it rather than just attempting to illustrate the segments ive changed, <?php $doc = new DOMDocument(); $doc->loadHTML('<html><body>Test<br><h2><b>quickshiftin@xxxxxxxxx</b></h2><p>text that we dont want to turn into a link.. quickshiftin@xxxxxxxxx</p><a name="bar">stuff inside the link</a>Foo<p>care</p><p>yoyser</p></body></html>'); echo 'IN:' . PHP_EOL . $doc->saveXML() . PHP_EOL; findTextNodes($doc->getElementsByTagName('*'), 'convertToLinkIfNecc'); echo 'OUT: ' . PHP_EOL . $doc->saveXML() . PHP_EOL; /** * run through a DOMNodeList, looking for text nodes. apply a callback to * all such text nodes that are encountered */ function findTextNodes(DOMNodeList $nodesToSearch, $callback) { foreach($nodesToSearch as $curNode) { if($curNode->hasChildNodes()) foreach($curNode->childNodes as $curChild) if($curChild instanceof DOMText) call_user_func($callback, $curNode, $curChild); } } /** * determine if a node should be modified, by chcking to see if a child is a text node * and the text looks like an email address. * call a subordinate function to convert the text node into a mailto anchor DOMElement */ function convertToLinkIfNecc(DomElement $textContainer, DOMText $textNode) { if(strtolower($textContainer->nodeName) === 'a') /// per original request dont bother w/ a tags return; if(filter_var($textNode->wholeText, FILTER_VALIDATE_EMAIL) !== false) { convertMailtoToAnchor($textContainer, $textNode); } else { /// lets see if theres an email burried in this text node /// regex taken from: http://www.regular-expressions.info/email.html preg_match('/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i', $textNode->wholeText, $matches); if(count($matches) > 0) rebuildTextNodeWithEmailAddrs($textContainer, $textNode, $matches); } } /** * given a DOMText instance w/ multiple email addresses, construct * a new set of nodes that contain the original text along w/ anchors for * all the bare email addresses */ function rebuildTextNodeWithEmailAddrs(DomElement $textContainer, DOMText $textNode, array $emailAddrs) { $nodeOrder = array(); /// construct array of elements $origText = $textNode->wholeText; foreach($emailAddrs as $curAddr) { $startPos = strpos($origText, $curAddr); // start pos of cur email $txtBuff = substr($origText, 0, $startPos); // buffer so we can check if its empty if(!empty($txtBuff)) { $eltTokens[] = $txtBuff; $nodeOrder[] = 't'; // indicate this token is a textNode } $eltTokens[] = $curAddr; $nodeOrder[] = 'e'; // indicate this token is an email addr $origText = substr($origText, $startPos + strlen($curAddr)); } /// now that we have the tokens delete the orig DOMText and drop in the replacements $textContainer->removeChild($textNode); foreach($eltTokens as $tokenIndex => $curToken) { if($nodeOrder[$tokenIndex] == 't') $textContainer->appendChild(new DOMText($curToken)); else { convertMailtoToAnchor($textContainer, new DOMText($curToken), false); } } } /** * modify a DOMElement that has a DOMText node as a child; create a DOMElement * that represents and a tag, and set the value and href attirbute, so that it * acts as a 'mailto' link * @param $shouldReplaceChild boolean if true; replace $textNode by new node, otherwise append $textNode to new node */ function convertMailtoToAnchor(DomElement $textContainer, DOMText $textNode, $shouldReplaceChild=true) { $newNode = new DomElement('a', $textNode->nodeValue); if($shouldReplaceChild) $textContainer->replaceChild($newNode, $textNode); else $textContainer->appendChild($newNode); $newNode->setAttribute('href', "mailto:{$textNode->nodeValue}"); } essentially, what we do when encountering a DOMText that contains embedded email addresses, is tokenize the elements, by storing everything thats not an email address, and then the email addresses; so we have an array that looks like { some text that could be empty , emailAddr1@xxxxxxxx , more non-email Text that could be empty , anotherEmail@xxxxxxxx, ... } then we remove the original DOMText child node; and start adding new children, which are either DOMText instances or our sooped up DOMElement anchor tags for the email addresses. -nathan