Re: generating an html intro text ...

Jochem Maas <jochem@xxxxxxxxxxxxx> · Mon, 18 Jun 2007 15:51:35 +0200

Stut wrote:
> Edward Kay wrote:

...

>>
>> A few thoughts I've had on this problem:
>>
>> Assuming it is well formed HTML, you could use a stack. Parse the string
>> putting all opening tags on the stack and then removing them when the
>> close
>> tag is found. This will leave you with all the un-closed tags and in the
>> correct order.
>>
>> Remember that some tags can be self closing (<br /> etc).

I'd rather not go down this route I think - because I feel it's too
error prone given the kind of crap one might expect as input (for instance)

>>
>> Could you use one of the XML extensions to build a DOM of the snippet?
>> I've
>> not used them much myself, but I'm guessing they could handle most of the
>> tag stripping and matching...

indeed I am leaning towards XML extension and/or Tidy with regard to finding
some kind of solution.

> 
> I'm sure Tidy could be employed to do this job. Grab your target length
> of text, backtrack until you find < or >. If it's a < then chop that bit
> off. Then give it to Tidy to fix the HTML. That should close off any
> open tags and give you a properly formed snippet.
> 
> Or you could implement the same functionality yourself quite easily, and
> it doesn't need to be well-formed HTML but you would need to check for
> tags that don't need closing (img, br, etc).

Stut, I going to go out on a limb and say 'bullshit' to those comments,
no personal offence meant (I mostly value your input *alot*, but I guess we
can't agree all the time):

1. simply backtracking to a '<' or '>' doesn't account for unescaped output
within the content pieces/parts of the html, e.g.:

	<P>look this here character '<' ammounts to invalid html<p>

2. backtracking in the manner you describe does account for a large ammount of
hmtl at the beginning of the string - remember that the aim of the game is
to truncate the user visible content of the html string in question
(the comment below would be required to be part of the output - think html comment
spam, which is quite effective and some clients state as a requirement):

	<!-- imagine this comment is 1000s of chars long --><P>look at this</P>

you say 'you could implement the same functionality yourself quite easily' - if
that was true then why does nobody have a readymade solution - it's not as if the
requirement is 'way out there' as far as web development goes. I don't think creating routines
for doing advanced manipulation of XML/HTML strings 'properly' is something very simple,
something that can be atested to by the relative complexity of libtidy and libxml2.

I not looking for a hack job here, I need something that can be relied on to output
html strings that will not break the validity (& layout) of a page regardless of the
quality of input.

the more I think about it the more I inclined to think that Tidy is about to become
a close friend. :-)

> 
> -Stut
> 

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php