Stut wrote: > Edward Kay wrote: ... >> >> A few thoughts I've had on this problem: >> >> Assuming it is well formed HTML, you could use a stack. Parse the string >> putting all opening tags on the stack and then removing them when the >> close >> tag is found. This will leave you with all the un-closed tags and in the >> correct order. >> >> Remember that some tags can be self closing (<br /> etc). I'd rather not go down this route I think - because I feel it's too error prone given the kind of crap one might expect as input (for instance) >> >> Could you use one of the XML extensions to build a DOM of the snippet? >> I've >> not used them much myself, but I'm guessing they could handle most of the >> tag stripping and matching... indeed I am leaning towards XML extension and/or Tidy with regard to finding some kind of solution. > > I'm sure Tidy could be employed to do this job. Grab your target length > of text, backtrack until you find < or >. If it's a < then chop that bit > off. Then give it to Tidy to fix the HTML. That should close off any > open tags and give you a properly formed snippet. > > Or you could implement the same functionality yourself quite easily, and > it doesn't need to be well-formed HTML but you would need to check for > tags that don't need closing (img, br, etc). Stut, I going to go out on a limb and say 'bullshit' to those comments, no personal offence meant (I mostly value your input *alot*, but I guess we can't agree all the time): 1. simply backtracking to a '<' or '>' doesn't account for unescaped output within the content pieces/parts of the html, e.g.: <P>look this here character '<' ammounts to invalid html<p> 2. backtracking in the manner you describe does account for a large ammount of hmtl at the beginning of the string - remember that the aim of the game is to truncate the user visible content of the html string in question (the comment below would be required to be part of the output - think html comment spam, which is quite effective and some clients state as a requirement): <!-- imagine this comment is 1000s of chars long --><P>look at this</P> you say 'you could implement the same functionality yourself quite easily' - if that was true then why does nobody have a readymade solution - it's not as if the requirement is 'way out there' as far as web development goes. I don't think creating routines for doing advanced manipulation of XML/HTML strings 'properly' is something very simple, something that can be atested to by the relative complexity of libtidy and libxml2. I not looking for a hack job here, I need something that can be relied on to output html strings that will not break the validity (& layout) of a page regardless of the quality of input. the more I think about it the more I inclined to think that Tidy is about to become a close friend. :-) > > -Stut > -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php