strip tags but preserve title attributes

Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx> · Mon, 14 Dec 2009 23:43:27 +0000

I'm looking for a way to strip HTML tags out of some text content
(sourced from a web page) to leave just the text which I'll be running
some basic analysis on. The thing is, I want to preserve text that is in
alt and title attributes. I can't use any DOM functions, as I can't
guarantee that the content will be valid XHTML, although it should be
valid HTML.

I'm happy doing this with string functions and regular expressions, but
I was wondering if something for this already existed? The server I plan
on putting this on does not have access to the shell (although it is a
Linux server) so I won't be able to have Lynx or Elinks parse the
content for me either :(

Thanks,
Ash
http://www.ashleysheridan.co.uk