Re: Using DOM textContent Property

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 10, 2008 at 10:35 AM, Tim Gustafson <tjg@xxxxxxxxxxxx> wrote:

> Nathan,
>
> Thanks for your help on this.
>
> I actually need to do this a different way I think though.  The problem is
> that I'm not just replacing a text entity with a link entity.  For example,
> consider this paragraph:
>
> <p>For information, please contact tjg@xxxxxxxxxxxxx</p>
>
> In this case, I want "tjg@xxxxxxxxxxxx" to be a link, but not the rest of
> the paragraph.  That means that the <p> entity has to be split into three
> separate entities - one DOMText for "For information, please contact ", one
> DOMEntity node for tjg@xxxxxxxxxxxx, and one DOMText node for ".".
>
> This seems doable with the DOM modle, but complicated.  I'm thinking
> regular
> expressions might be the way to go again.  :\


so use some regex :D  thats the only way i know of to determine if DOMText
nodes contain email address(s) as substrings while retaining ones sanity...
i got it working, again by modifying the code from my original post and
dropping in an additional clause which will use regex to determine if there
is an email address embedded in a DOMText node, however, it checks to see if
the whole thing is a mail first, cause i think thats a little optimization,
but it could be ommitted.  heres the output of the script now (notice i
changed the input text to incorporate the new issue):

nathan@devel ~/domIterator/initialTests $ php testDom.php
IN:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "
http://www.w3.org/TR/REC-html40/loose.dtd";>
<html><body>Test<br/><h2><b>quickshiftin@xxxxxxxxx</b></h2><p>text that we
dont want to turn into a link.. quickshiftin@xxxxxxxxx</p><a
name="bar">stuff inside the
link</a>Foo<p>care</p><p>yoyser</p></body></html>

OUT:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "
http://www.w3.org/TR/REC-html40/loose.dtd";>
<html><body>Test<br/><h2><b><a href="mailto:quickshiftin@xxxxxxxxx";>
quickshiftin@xxxxxxxxx</a></b></h2><p>text that we dont want to turn into a
link.. <a href="mailto:quickshiftin@xxxxxxxxx";>quickshiftin@xxxxxxxxx</a></p><a
name="bar">stuff inside the
link</a>Foo<p>care</p><p>yoyser</p></body></html>

and here is the code; sorry for the lengthy post fellas, i just want to post
all of it rather than just attempting to illustrate the segments ive
changed,

<?php
$doc = new DOMDocument();
$doc->loadHTML('<html><body>Test<br><h2><b>quickshiftin@xxxxxxxxx</b></h2><p>text
that we dont want to turn into a link.. quickshiftin@xxxxxxxxx</p><a
name="bar">stuff inside the
link</a>Foo<p>care</p><p>yoyser</p></body></html>');
echo 'IN:' . PHP_EOL . $doc->saveXML() . PHP_EOL;
findTextNodes($doc->getElementsByTagName('*'), 'convertToLinkIfNecc');
echo 'OUT: ' .  PHP_EOL . $doc->saveXML() . PHP_EOL;

/**
 * run through a DOMNodeList, looking for text nodes.  apply a callback to
 * all such text nodes that are encountered
 */
function  findTextNodes(DOMNodeList $nodesToSearch, $callback) {
    foreach($nodesToSearch as $curNode) {
        if($curNode->hasChildNodes())
            foreach($curNode->childNodes as $curChild)
                if($curChild instanceof DOMText)
                    call_user_func($callback, $curNode, $curChild);
    }
}

/**
 * determine if a node should be modified, by chcking to see if a child is a
text node
 * and the text looks like an email address.
 * call a subordinate function to convert the text node into a mailto anchor
DOMElement
 */
function convertToLinkIfNecc(DomElement $textContainer, DOMText $textNode) {
    if(strtolower($textContainer->nodeName) === 'a') /// per original
request dont bother w/ a tags
        return;
    if(filter_var($textNode->wholeText, FILTER_VALIDATE_EMAIL) !== false) {
        convertMailtoToAnchor($textContainer, $textNode);
    } else { /// lets see if theres an email burried in this text node
        /// regex taken from: http://www.regular-expressions.info/email.html
        preg_match('/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i',
$textNode->wholeText, $matches);
        if(count($matches) > 0)
            rebuildTextNodeWithEmailAddrs($textContainer, $textNode,
$matches);
    }
}

/**
 * given a DOMText instance w/ multiple email addresses, construct
 * a new set of nodes that contain the original text along w/ anchors for
 * all the bare email addresses
 */
 function rebuildTextNodeWithEmailAddrs(DomElement $textContainer, DOMText
$textNode, array $emailAddrs) {
     $nodeOrder = array();
    /// construct array of elements
    $origText = $textNode->wholeText;
    foreach($emailAddrs as $curAddr) {
        $startPos = strpos($origText, $curAddr);    // start pos of cur
email
        $txtBuff = substr($origText, 0, $startPos);    // buffer so we can
check if its empty
        if(!empty($txtBuff)) {
            $eltTokens[] = $txtBuff;
            $nodeOrder[] = 't';    // indicate this token is a textNode
        }
        $eltTokens[] = $curAddr;
        $nodeOrder[] = 'e';    // indicate this token is an email addr
        $origText = substr($origText, $startPos + strlen($curAddr));
    }
    /// now that we have the tokens delete the orig DOMText and drop in the
replacements
    $textContainer->removeChild($textNode);
    foreach($eltTokens as $tokenIndex => $curToken) {
        if($nodeOrder[$tokenIndex] == 't')
            $textContainer->appendChild(new DOMText($curToken));
        else {
            convertMailtoToAnchor($textContainer, new DOMText($curToken),
false);
        }
    }
 }

/**
 * modify a DOMElement that has a DOMText node as a child; create a
DOMElement
 * that represents and a tag, and set the value and href attirbute, so that
it
 * acts as a 'mailto' link
 * @param $shouldReplaceChild boolean if true; replace $textNode by new
node, otherwise append $textNode to new node
 */
function convertMailtoToAnchor(DomElement $textContainer, DOMText $textNode,
$shouldReplaceChild=true) {
    $newNode = new DomElement('a', $textNode->nodeValue);
    if($shouldReplaceChild)
        $textContainer->replaceChild($newNode, $textNode);
    else
        $textContainer->appendChild($newNode);
    $newNode->setAttribute('href', "mailto:{$textNode->nodeValue}");
}

essentially, what we do when encountering a DOMText that contains embedded
email addresses, is tokenize the elements, by storing everything thats not
an email address, and then the email addresses; so we have an array that
looks like
{ some text that could be empty , emailAddr1@xxxxxxxx , more non-email Text
that could be empty , anotherEmail@xxxxxxxx, ... }
then we remove the original DOMText child node; and start adding new
children, which are either DOMText instances or our sooped up DOMElement
anchor tags for the email addresses.

-nathan

[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux