Re: getting content exceprts from the database

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ashley Sheridan wrote:
> I've been thinking about this problem for a little while, and the thing
> is, I can think of ways of doing it, but they're not very nice, and I
> don't think they're going to be fast.
> 
> Basically, I have a load of HTML formatted content in a database that
> get displayed onto the site. It's part of a rudimentary CMS.
> 
> Currently, the titles for each article are displayed on a page, and each
> title links to the full article. However, that leaves me with a page
> which is essentially a list of links, and that's not ideal for SEO. What
> I wanted to do to enhance the page is to have a short excerpt of x
> number of words/characters beneath each article title. The idea being
> that search engines will find the page as more than a link farm, and
> visitors won't have to just rely on the title alone for the content.
> 
> Here's the rub though. As the content is in HTML form, I can't just grab
> the first 100 characters and display them as that could leave an open
> tag  without a closing one, potentially breaking the page. I could use
> strip_tags on the 100-character excerpt, but what if the excerpt itself
> broke a tag in half (i.e. <acronym title="something"> could become
> <acron )
> 
> The only solutions I can see are:
> 
> 
>       * retrieve the entire article, perform a strip_tags and then take
>         the excerpt
>       * use a regex inside of mysql to pull out only the text
> 
> 
> The thing is, neither of these seems particularly pretty, and I am sure
> there's a better way, but it's too early in the week for my brain to be
> fully functional I think!
> 
> Does anyone have any ideas about what I could do, or do you think I'm
> seeing problems where there are none?
> 
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
> 

/**
 * creates an abstract from any string, a nice one that stops at a full
 * stop or end of a word betwen 140-180 chars.
 *
 */
function createAbstract( $string )
{
	$lines = explode( "\n" , $string );
	if( count($lines) > 1 && strlen($lines[0]) > 140 ) {
		$string = $lines[0];
	}
	if( strlen($string) < 180 ) return $string;
	$string = substr( $string , 0 , 180);
	$chars = str_split( $string );
	$string = '';
	foreach( $chars as $char ) {
		$string .= $char;
		if( $char == '.' && strlen($string) > 120 ) {
			return $string;
		}
	}
	$string = '';
	foreach( $chars as $char ) {
		$string .= $char;
		if( $char == ' ' && strlen($string) > 140 ) {
			return trim( $string ) . '...';
		}
	}
	return $string;
}

/**
 * given an html (or fragment) tidy in to usable html
 * and strip back to text, new lines in tact
 *
 */
function htmlToText( $html )
{
  $html = str_replace( '&' , '&amp;' , str_replace( '&amp;' , '&' ,
$html ) );
  $config = array(
    'clean' => true,
    'drop-proprietary-attributes' => true,
    'output-xhtml' => true,
    'show-body-only' => true,
    'word-2000' => true,
    'wrap' => '0'
    );
  $tidy = new tidy();
  $tidy->parseString($html, $config, 'utf8');
  $tidy->cleanRepair();
  $html = tidy_get_output($tidy);
  $text = str_replace( '&' , '&amp;' , str_replace( '&amp;' , '&' ,
$text ) );
  return strip_tags($text);
}

using those two together should do it; they're pretty basic and could do
with a tidy, but gets the job done (you'll probably want to change the
140 chars to something different)

Best,

Nathan

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux