RE: generating an html intro text ...

"Edward Kay" <edward@xxxxxxxxxx> · Mon, 18 Jun 2007 14:07:28 +0100

> -----Original Message-----
> From: Jochem Maas [mailto:jochem@xxxxxxxxxxxxx]
> Sent: 18 June 2007 13:18
> To: tedd
> Cc: [php] PHP General List
> Subject: Re:  generating an html intro text ...
>
>
> tedd wrote:
> > At 11:39 AM +0200 6/14/07, Jochem Maas wrote:
> >> original string:
> >>
>
> ...
> >
> > The problem as I see it is covering all the possibilities that may occur
> > even if the text is well formed. Like what if someone introduces a span
> > that sets a color for a paragraph, such as:
> >
> > <span color:"yellow"; >Dolore magna aliquam erat volutpat ut wisi enim
> > ad minim veniam quis nostrud. Consectetuer adipiscing elit sed diam
> > nonummy nibh euismod tincidunt ut laoreet exerci tation ullamcorper
> > suscipit lobortis! <b>Decima eodem modo </b>typi qui nunc nobis videntur
> > parum clari fiant sollemnes in.<span>
> >
> > And the </b> tag as well as the </span> tag is outside the 256 limit?
> >
> > You would have to search out and pull in all closing tags.
> >
> > So, I guess an algorithm could be:
>
> roughly speaking yes this is what is would do, except:
>
> >
> > First, grab 256 characters -- The string. If The string is shorter, then
> > quit.
>
> the algo should only be counting 'content characters', i.e.
> anything that is
> html markup should not go towards the string length count,
> additionally html entities
> such as '&amp;' should be considered as a single character.
>
> >
> > Second, determine what tags are not closed.
> >
> > Third, create closing tags and add them to the end of The string (in
> > proper order).
> >
> > Fourth, then remove the same number of non-html characters from the end
> > of The string.
>
> what the code should do (mmore or less) is quite clear - writing something
> flexible & robust to actually do it (and do it fast) is quite
> another matter.
>
> I have been looking at Edward Vermillon's code but I suspect that
> what he sent
> me is not quite what I'm looking for for a number of reasons:
>
> 1. it deals primarily with custom bbcode like markup
> 2. I have a couple of doubts about the handling of html entities
> 3. performance
>
> that said I still have to look at it in depth before making any real
> conclusions as to it's viability (and or the possiblity to rework the
> code to fit my needs).
>
> I'm also looking at an alternative where by I go through the
> string and truncate it at the character (or characters that
> represent an html entity) that reresents the Nth 'content character'
> and then feeding the truncated string to the Tidy extension and let it
> figure out the html cleaning part ... does anyone have experience
> using tidy
> to clean (make valid) html snippets using Tidy, that they would
> like to share?
>

A few thoughts I've had on this problem:

Assuming it is well formed HTML, you could use a stack. Parse the string
putting all opening tags on the stack and then removing them when the close
tag is found. This will leave you with all the un-closed tags and in the
correct order.

Remember that some tags can be self closing (<br /> etc).

Could you use one of the XML extensions to build a DOM of the snippet? I've
not used them much myself, but I'm guessing they could handle most of the
tag stripping and matching...

HTH,
Edward

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php