Re: generating an html intro text ...

Edward Vermillion <evermillion@xxxxxxxxxxxx> · Thu, 14 Jun 2007 06:14:36 -0500

On Jun 14, 2007, at 4:39 AM, Jochem Maas wrote:

hi list,

having search and not found anything useful I was wondering if  
anyone here
had a decent routine for doing the following:

given a relatively long text containing html I need to generate
an 'intro' version of this string containing a given number of  
display characters
(e.g. 256) that still contains the relevant valid html ...  
basically I'm looking
for something that does content truncation but takes into account  
possible
html and htmlentities that may be part of the content.

an example (chances are what I'm asking is not wholly clear):

original string:

	"<b>HELLO</b>, my name is charlie brown<i>!</i> &amp; I'm a little  
odd.";

shorten text (32 'letters' required):

	"My name is <b>charlie brown</b><i>!</i> &amp; I'm ";

the 32 'letter' length should therefore ignore the B and I tags and  
treat the &amp; as
a single letter ... additionally when truncation occurs with a set  
of html tags the
resulting string should have all the open html tags properly closed.

this is not as simple as it may first seem, I could probably do it  
but I foresee it taking
quite some time (which I don't have ... let's all sing 'deadline'  
together shall we ;-)),
in the past I have attempted such a routine but always ended up  
doing something much simpler
(using strip_tags(), etc) due to time constraints.

I figure I'm not the only one who has had the requirement to do  
sensible truncation of html content,
and I'm hoping someone might have a routine or know where I can  
find one.

apologies if I have not been searching well enough - part of my  
problem is likely to
be that I don't really know what search terms to use :-/

anyway if anyone has any solid code or know of any I'd be very  
grateful.

kind regards,
Jochem

I just wrote a fairly simple routine to do this with BB style tags a  
few weeks ago. I'm not sure if it could be adapted for real html or  
not. Basically it does a character by character check of the text and  
keeps track of the opening and closing tags and only counts the  
content. So it could be extremely inefficient for large text blocks,  
although profiling a few tests on a very quiet development server  
didn't look too bad.

There's no entity checking, and odd nested tags  
(<tag1>blah<tag2>blah</tag1></tag2>) just get closed at the point the  
oddity is discovered, which could mean that the summary looks  
different from the actual text.

If you can't find anything else, and you think this might be useful  
to you, let me know and I can send you what I have.

Ed

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php