Jochem Maas wrote:
Stut wrote:
I'm sure Tidy could be employed to do this job. Grab your target length
of text, backtrack until you find < or >. If it's a < then chop that bit
off. Then give it to Tidy to fix the HTML. That should close off any
open tags and give you a properly formed snippet.
Or you could implement the same functionality yourself quite easily, and
it doesn't need to be well-formed HTML but you would need to check for
tags that don't need closing (img, br, etc).
Stut, I going to go out on a limb and say 'bullshit' to those comments,
no personal offence meant (I mostly value your input *alot*, but I guess we
can't agree all the time):
None taken, but I think you're making this into a bigger problem than it is.
1. simply backtracking to a '<' or '>' doesn't account for unescaped output
within the content pieces/parts of the html, e.g.:
<P>look this here character '<' ammounts to invalid html<p>
If the content is not escaped then it's not going to get displayed
properly by the browser, so why worry about it?
2. backtracking in the manner you describe does account for a large ammount of
hmtl at the beginning of the string - remember that the aim of the game is
to truncate the user visible content of the html string in question
(the comment below would be required to be part of the output - think html comment
spam, which is quite effective and some clients state as a requirement):
<!-- imagine this comment is 1000s of chars long --><P>look at this</P>
Comments are easily removed before snippet generation with a simple regex.
you say 'you could implement the same functionality yourself quite easily' - if
that was true then why does nobody have a readymade solution - it's not as if the
requirement is 'way out there' as far as web development goes. I don't think creating routines
for doing advanced manipulation of XML/HTML strings 'properly' is something very simple,
something that can be atested to by the relative complexity of libtidy and libxml2.
I not looking for a hack job here, I need something that can be relied on to output
html strings that will not break the validity (& layout) of a page regardless of the
quality of input.
the more I think about it the more I inclined to think that Tidy is about to become
a close friend. :-)
If I were you I would definitely be looking at Tidy because it's been
well tested in the real world and will save a huge amount of effort.
However, I don't think writing code to do this amounts to a hack. The
rules of markup are very clear and with a little thought you could
easily handle the vast majority of possible situations.
The most hack-like way would be to purely ensure that you don't break in
the middle of an actual tag (whether it be an opening or closing tag),
and then wrap it in another tag and let the browser take care of closing
unclosed tags.
-Stut
--
http://stut.net/
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php