Re: strip tags but preserve title attributes

Andrew Ballard <aballard@xxxxxxxxx> · Tue, 15 Dec 2009 09:32:08 -0500

On Mon, Dec 14, 2009 at 6:43 PM, Ashley Sheridan
<ash@xxxxxxxxxxxxxxxxxxxx> wrote:
> I'm looking for a way to strip HTML tags out of some text content
> (sourced from a web page) to leave just the text which I'll be running
> some basic analysis on. The thing is, I want to preserve text that is in
> alt and title attributes. I can't use any DOM functions, as I can't
> guarantee that the content will be valid XHTML, although it should be
> valid HTML.
>
> I'm happy doing this with string functions and regular expressions, but
> I was wondering if something for this already existed? The server I plan
> on putting this on does not have access to the shell (although it is a
> Linux server) so I won't be able to have Lynx or Elinks parse the
> content for me either :(
>
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>

Are you sure you can't use DOM? It has a function specifically for
parsing HTML that "does not have to be well-formed to load."

http://www.php.net/manual/en/domdocument.loadhtml.php

If that doesn't work, you might look at Zend_Filter_StripTags in ZF. I
don't know if it will do exactly what you're after, but it seems to be
more flexible than the strip_tags function built into PHP.

Andrew

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php