Re: HTML text extraction

Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx> · Tue, 18 Aug 2009 09:41:07 +0100



On Tue, 2009-08-18 at 01:37 -0700, leledumbo wrote:
> Usually, a website gives preview of its articles by extracting some of the
> first characters. This is easy if the article is a pure text, but what if
> it's a HTML text? For instance, if I have the full text:
> 
> <p>
>   bla bla bla
>   <ul>
>     <li>item 1</li>
>     <li>item 2</li>
>     <li>item 3</li>
>   </ul>
> </p>
> 
> and I take the first 40 characters, it would result in:
> 
> <p>
>   bla bla bla
>   <ul>
>     <li>item
> 
> As you can see, the tags are incomplete and it might break other texts below
> it (I mean, other than this preview). I need a way to solve this problem.
> 
> -- 
> View this message in context: http://www.nabble.com/HTML-text-extraction-tp25020687p25020687.html
> Sent from the PHP - General mailing list archive at Nabble.com.
> 
> 
You could do a couple of things:

      * Extract all the content and use strip_tags() to remove all the
        HTML markup. In the example you gave it might look a bit odd if
        the content suggests it was originally a list.
      * Access the extracted content through the DOM, and grab the
        textual content you need using node values. That way, you can
        limit it to a specific character count of content, and with a
        bit of work, you can preserve the original markup tags too

Thanks,
Ash
http://www.ashleysheridan.co.uk


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php