Re: getting content exceprts from the database

Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx> · Mon, 26 Apr 2010 12:23:53 +0100

On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote:

> On 26 April 2010 12:52, Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx> wrote:
> > I've been thinking about this problem for a little while, and the thing
> > is, I can think of ways of doing it, but they're not very nice, and I
> > don't think they're going to be fast.
> >
> > Basically, I have a load of HTML formatted content in a database that
> > get displayed onto the site. It's part of a rudimentary CMS.
> >
> > Currently, the titles for each article are displayed on a page, and each
> > title links to the full article. However, that leaves me with a page
> > which is essentially a list of links, and that's not ideal for SEO. What
> > I wanted to do to enhance the page is to have a short excerpt of x
> > number of words/characters beneath each article title. The idea being
> > that search engines will find the page as more than a link farm, and
> > visitors won't have to just rely on the title alone for the content.
> >
> > Here's the rub though. As the content is in HTML form, I can't just grab
> > the first 100 characters and display them as that could leave an open
> > tag  without a closing one, potentially breaking the page. I could use
> > strip_tags on the 100-character excerpt, but what if the excerpt itself
> > broke a tag in half (i.e. <acronym title="something"> could become
> > <acron )
> >
> > The only solutions I can see are:
> >
> >
> >      * retrieve the entire article, perform a strip_tags and then take
> >        the excerpt
> >      * use a regex inside of mysql to pull out only the text
> >
> >
> > The thing is, neither of these seems particularly pretty, and I am sure
> > there's a better way, but it's too early in the week for my brain to be
> > fully functional I think!
> >
> > Does anyone have any ideas about what I could do, or do you think I'm
> > seeing problems where there are none?
> 
> Use htmltidy or htmlpurifier to clean up things. I.e. grab the amount
> of content you want, then use one of the tools to repair and clean the
> html.
> 
> Regards
> Peter
> 
> -- 
> <hype>
> WWW: http://plphp.dk / http://plind.dk
> LinkedIn: http://www.linkedin.com/in/plind
> Flickr: http://www.flickr.com/photos/fake51
> BeWelcome: Fake51
> Couchsurfing: Fake51
> </hype>
> 

Would that work on content that stopped mid-tag? Assuming the original
copy is:

<p>This is some sentence, with an <abbr title="Abbreviation">abbr</abbr>
in the middle of it.</p>

If I was asking for only the first 50 characters, I'd get this:

<p>This is some sentence, with an <abbr title="Abb

Would either htmltidy or htmlpurifier be able to handle that? I don't
mind whether it tries to repair the tag or remove it completely, as long
as it does something to it.

Thanks,
Ash
http://www.ashleysheridan.co.uk