Stut wrote: > Jochem Maas wrote: >> Stut wrote: >>> I'm sure Tidy could be employed to do this job. Grab your target length >>> of text, backtrack until you find < or >. If it's a < then chop that bit >>> off. Then give it to Tidy to fix the HTML. That should close off any >>> open tags and give you a properly formed snippet. >>> >>> Or you could implement the same functionality yourself quite easily, and >>> it doesn't need to be well-formed HTML but you would need to check for >>> tags that don't need closing (img, br, etc). >> >> Stut, I going to go out on a limb and say 'bullshit' to those comments, >> no personal offence meant (I mostly value your input *alot*, but I >> guess we >> can't agree all the time): > > None taken, good. > but I think you're making this into a bigger problem than it > is. okay. > >> 1. simply backtracking to a '<' or '>' doesn't account for unescaped >> output >> within the content pieces/parts of the html, e.g.: >> >> <P>look this here character '<' ammounts to invalid html<p> > > If the content is not escaped then it's not going to get displayed > properly by the browser, so why worry about it? because plenty/most browser do display it 'properly' even though the actual HTML is valid - most of my clients use that POS called IE - which does an amazing job at turing dogshit tagsoup (especially M$ proprietary generate cruft - think FrontPage, Word, Publisher etc) into something 'pretty' in the browser window (and usually displaying what the user/client expected ... we'll not go it the flipside where valid HTML is borked when output :-/ the client cares that things always display correctly - regardless of the cruft that marketing drone X input into the [internal company] backend system .. when it's 10-30K euros in turnover a day that's understandable. > >> 2. backtracking in the manner you describe does account for a large >> ammount of >> hmtl at the beginning of the string - remember that the aim of the >> game is >> to truncate the user visible content of the html string in question >> (the comment below would be required to be part of the output - think >> html comment >> spam, which is quite effective and some clients state as a requirement): >> >> <!-- imagine this comment is 1000s of chars long --><P>look at >> this</P> > > Comments are easily removed before snippet generation with a simple regex. > >> you say 'you could implement the same functionality yourself quite >> easily' - if >> that was true then why does nobody have a readymade solution - it's >> not as if the >> requirement is 'way out there' as far as web development goes. I don't >> think creating routines >> for doing advanced manipulation of XML/HTML strings 'properly' is >> something very simple, >> something that can be atested to by the relative complexity of libtidy >> and libxml2. >> >> I not looking for a hack job here, I need something that can be relied >> on to output >> html strings that will not break the validity (& layout) of a page >> regardless of the >> quality of input. >> >> the more I think about it the more I inclined to think that Tidy is >> about to become >> a close friend. :-) > > If I were you I would definitely be looking at Tidy because it's been > well tested in the real world and will save a huge amount of effort. > However, I don't think writing code to do this amounts to a hack. The > rules of markup are very clear and with a little thought you could > easily handle the vast majority of possible situations. probably but 'vast majority' != 'ALL' - and I garantee marketing drone X will break the system before I can turn my back, murphy's law and all that! 1 word, short word, 2 syllables, first syllable sounds like 'sky'. :-) -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php