Re: generating an html intro text ...

"Daniel Brown" <parasane@xxxxxxxxx> · Thu, 14 Jun 2007 13:04:57 -0400

On 6/14/07, Edward Vermillion <evermillion@xxxxxxxxxxxx> wrote:

On Jun 14, 2007, at 4:39 AM, Jochem Maas wrote:

> hi list,
>
> having search and not found anything useful I was wondering if
> anyone here
> had a decent routine for doing the following:
>
> given a relatively long text containing html I need to generate
> an 'intro' version of this string containing a given number of
> display characters
> (e.g. 256) that still contains the relevant valid html ...
> basically I'm looking
> for something that does content truncation but takes into account
> possible
> html and htmlentities that may be part of the content.
>
> an example (chances are what I'm asking is not wholly clear):
>
> original string:
>
>       "<b>HELLO</b>, my name is charlie brown<i>!</i> &amp; I'm a little
> odd.";
>
> shorten text (32 'letters' required):
>
>       "My name is <b>charlie brown</b><i>!</i> &amp; I'm ";
>
> the 32 'letter' length should therefore ignore the B and I tags and
> treat the &amp; as
> a single letter ... additionally when truncation occurs with a set
> of html tags the
> resulting string should have all the open html tags properly closed.
>
> this is not as simple as it may first seem, I could probably do it
> but I foresee it taking
> quite some time (which I don't have ... let's all sing 'deadline'
> together shall we ;-)),
> in the past I have attempted such a routine but always ended up
> doing something much simpler
> (using strip_tags(), etc) due to time constraints.
>
> I figure I'm not the only one who has had the requirement to do
> sensible truncation of html content,
> and I'm hoping someone might have a routine or know where I can
> find one.
>
> apologies if I have not been searching well enough - part of my
> problem is likely to
> be that I don't really know what search terms to use :-/
>
> anyway if anyone has any solid code or know of any I'd be very
> grateful.
>
> kind regards,
> Jochem

I just wrote a fairly simple routine to do this with BB style tags a
few weeks ago. I'm not sure if it could be adapted for real html or
not. Basically it does a character by character check of the text and
keeps track of the opening and closing tags and only counts the
content. So it could be extremely inefficient for large text blocks,
although profiling a few tests on a very quiet development server
didn't look too bad.

There's no entity checking, and odd nested tags
(<tag1>blah<tag2>blah</tag1></tag2>) just get closed at the point the
oddity is discovered, which could mean that the summary looks
different from the actual text.

If you can't find anything else, and you think this might be useful
to you, let me know and I can send you what I have.

Ed

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

   Okay, I took a couple of hours to throw something together, but my
testing was at the CLI level, so online it looks a little strange.  It
also added more than what you were asking for, but I may decide to use
this myself some day, so it made sense to just add it now.

   Original code is also available on my site:
       http://www.pilotpig.net/code-library/strcnt.php
       http://www.pilotpig.net/code-library/strcnt-source.php

   Here's the code:

<?
// strcnt.php
#
# Daniel P. Brown (parasane@xxxxxxxxx) for Jochem Maas on the PHP-General list.
# 14 June, 2007 - 10:17 EDT
#

/*********************************************************************************\
*                                         *
* NOTE: Only works with PHP >= 4.3.3, according to
http://www.php.net/str_replace *
*       I also may have gone a bit over what you asked, by adding a few       *
*       other functions into the mix.  Delete what you don't want to use.     *
*                                         *
\*********************************************************************************/

function sanitize($str,$tags) {
   preg_match_all($tags,$str,$html_chars);
   return $html_chars[0];
}

function htmlstr($str,$mode='1') { // It *WAS* just going to be to
count characters.... I swear!  So that's the default.
   /*
       MODES:
           1       Return count after removing HTML
           2       Return total count of HTML code, including carats
           3       Return actual sanitized text
           4       Return difference in length between original (with
HTML) and sanitized (without HTML)
           5       Return count after removing HTML and translating
HTML special characters
           6       Return difference in length between original (with
HTML + chars) and sanitized (without)

       NOTES:
           There is no equivalent to Mode 3 to remove HTML special
characters, because either:
               a.) It will be displayed on the web and translated in
the browser.
               b.) It will be displayed via another medium, and I
don't feel like translating everything!

           At one point, this had private and recursive function calls,
               but this was removed when I realized it broke.

           Recursive functions were re-added once I figured out what
I screwed up.
   */

   $len = strlen($str);
   $open = "/(<([\w]+)[^>]*>)/";
   $close = "/(<(\/([\w]+)[^>]*)>)/";
   $spec = "/(&([\w]+)[^>]*)/";

   switch($mode) {
       case '1':
           // This will return the count after removing the HTML
           return
strlen(str_replace(array_merge(sanitize($str,$open),sanitize($str,$close)),null,$str));
       case '2':
           // This will return the total count of HTML code
characters, including the carats
           return
strlen(implode('',array_merge(sanitize($str,$open),sanitize($str,$close))));
       case '3':
           // This will return the actual sanitized text, HTML-less
           return
str_replace(array_merge(sanitize($str,$open),sanitize($str,$close)),null,$str);
       case '4':
           // This will return the difference in length between the
original (with HTML) and the non-HTML
           return ($len -
strlen(str_replace(array_merge(sanitize($str,$open),sanitize($str,$close)),null,$str)));
       case '5':
           // This will count the length of a string with all HTML
special characters translated, code removed
           return
strlen(str_replace(sanitize(htmlstr($str,3),$spec),"
",htmlstr($str,3)));
       case '6':
           // This will return the difference between the original
length and the length of Mode 5
           return ($len - htmlstr($str,5));
   }
}

$str =<<<EOS
<H1>Hello!</H1><BR />
This is going to be a test of the <I>string-minus-HTML</I>
function file I'm attempting to create for a member of the
&quot;PHP General mailing list.&quot;  PHP is located here:
<a href="http://www.php.net/";>http://www.php.net/</a>.  It
is a <B><I>very</I></B> powerful &amp; extensible language!<BR />
<BR />
<BLOCKQUOTE><< That's all, folks!</BLOCKQUOTE>
EOS;

echo "This is the original string:\n";
echo $str."\n";
echo "\n";

echo "This is the original string count:\n";
echo strlen($str)."\n";
echo "\n";

echo "This is the total length of the string with HTML tags removed:\n";
echo htmlstr($str)."\n";
echo "\n";

echo "This is the total count of characters used in the HTML tags:\n";
echo htmlstr($str,'2')."\n";
echo "\n";

echo "This is the actual sanitized text:\n";
echo htmlstr($str,'3')."\n";
echo "\n";

echo "This is the difference between string lengths with and without
HTML tags:\n";
echo htmlstr($str,'4')."\n";
echo "\n";

echo "This is the length of the string with ALL HTML either stripped
or translated:\n";
echo htmlstr($str,'5')."\n";
echo "\n";

echo "This is the difference in length between the original string and
the COMPLETELY sanitized and translated string:\n";
echo htmlstr($str,'6')."\n";
echo "\n";
?>

   Hope it helps someone.... maybe even you, Jochem!

--
Daniel P. Brown
[office] (570-) 587-7080 Ext. 272
[mobile] (570-) 766-8107

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php