On Mon, Dec 14, 2009 at 6:43 PM, Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx> wrote: > I'm looking for a way to strip HTML tags out of some text content > (sourced from a web page) to leave just the text which I'll be running > some basic analysis on. The thing is, I want to preserve text that is in > alt and title attributes. I can't use any DOM functions, as I can't > guarantee that the content will be valid XHTML, although it should be > valid HTML. > > I'm happy doing this with string functions and regular expressions, but > I was wondering if something for this already existed? The server I plan > on putting this on does not have access to the shell (although it is a > Linux server) so I won't be able to have Lynx or Elinks parse the > content for me either :( > > Thanks, > Ash > http://www.ashleysheridan.co.uk > Are you sure you can't use DOM? It has a function specifically for parsing HTML that "does not have to be well-formed to load." http://www.php.net/manual/en/domdocument.loadhtml.php If that doesn't work, you might look at Zend_Filter_StripTags in ZF. I don't know if it will do exactly what you're after, but it seems to be more flexible than the strip_tags function built into PHP. Andrew -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php