Re: generating an html intro text ...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Edward Kay wrote:

-----Original Message-----
From: Jochem Maas [mailto:jochem@xxxxxxxxxxxxx]
Sent: 18 June 2007 13:18
To: tedd
Cc: [php] PHP General List
Subject: Re:  generating an html intro text ...


tedd wrote:
At 11:39 AM +0200 6/14/07, Jochem Maas wrote:
original string:

...
The problem as I see it is covering all the possibilities that may occur
even if the text is well formed. Like what if someone introduces a span
that sets a color for a paragraph, such as:

<span color:"yellow"; >Dolore magna aliquam erat volutpat ut wisi enim
ad minim veniam quis nostrud. Consectetuer adipiscing elit sed diam
nonummy nibh euismod tincidunt ut laoreet exerci tation ullamcorper
suscipit lobortis! <b>Decima eodem modo </b>typi qui nunc nobis videntur
parum clari fiant sollemnes in.<span>

And the </b> tag as well as the </span> tag is outside the 256 limit?

You would have to search out and pull in all closing tags.

So, I guess an algorithm could be:
roughly speaking yes this is what is would do, except:

First, grab 256 characters -- The string. If The string is shorter, then
quit.
the algo should only be counting 'content characters', i.e.
anything that is
html markup should not go towards the string length count,
additionally html entities
such as '&amp;' should be considered as a single character.

Second, determine what tags are not closed.

Third, create closing tags and add them to the end of The string (in
proper order).

Fourth, then remove the same number of non-html characters from the end
of The string.
what the code should do (mmore or less) is quite clear - writing something
flexible & robust to actually do it (and do it fast) is quite
another matter.

I have been looking at Edward Vermillon's code but I suspect that
what he sent
me is not quite what I'm looking for for a number of reasons:

1. it deals primarily with custom bbcode like markup
2. I have a couple of doubts about the handling of html entities
3. performance

that said I still have to look at it in depth before making any real
conclusions as to it's viability (and or the possiblity to rework the
code to fit my needs).

I'm also looking at an alternative where by I go through the
string and truncate it at the character (or characters that
represent an html entity) that reresents the Nth 'content character'
and then feeding the truncated string to the Tidy extension and let it
figure out the html cleaning part ... does anyone have experience
using tidy
to clean (make valid) html snippets using Tidy, that they would
like to share?


A few thoughts I've had on this problem:

Assuming it is well formed HTML, you could use a stack. Parse the string
putting all opening tags on the stack and then removing them when the close
tag is found. This will leave you with all the un-closed tags and in the
correct order.

Remember that some tags can be self closing (<br /> etc).

Could you use one of the XML extensions to build a DOM of the snippet? I've
not used them much myself, but I'm guessing they could handle most of the
tag stripping and matching...

I'm sure Tidy could be employed to do this job. Grab your target length of text, backtrack until you find < or >. If it's a < then chop that bit off. Then give it to Tidy to fix the HTML. That should close off any open tags and give you a properly formed snippet.

Or you could implement the same functionality yourself quite easily, and it doesn't need to be well-formed HTML but you would need to check for tags that don't need closing (img, br, etc).

-Stut

--
http://stut.net/

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux