Re: getting content exceprts from the database

Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx> · Mon, 26 Apr 2010 12:54:12 +0100

On Mon, 2010-04-26 at 07:58 -0400, Phpster wrote:

> 
> On Apr 26, 2010, at 7:23 AM, Ashley Sheridan  
> <ash@xxxxxxxxxxxxxxxxxxxx> wrote:
> 
> > On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote:
> >
> >> On 26 April 2010 12:52, Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx>  
> >> wrote:
> >>> I've been thinking about this problem for a little while, and the  
> >>> thing
> >>> is, I can think of ways of doing it, but they're not very nice,  
> >>> and I
> >>> don't think they're going to be fast.
> >>>
> >>> Basically, I have a load of HTML formatted content in a database  
> >>> that
> >>> get displayed onto the site. It's part of a rudimentary CMS.
> >>>
> >>> Currently, the titles for each article are displayed on a page,  
> >>> and each
> >>> title links to the full article. However, that leaves me with a page
> >>> which is essentially a list of links, and that's not ideal for  
> >>> SEO. What
> >>> I wanted to do to enhance the page is to have a short excerpt of x
> >>> number of words/characters beneath each article title. The idea  
> >>> being
> >>> that search engines will find the page as more than a link farm, and
> >>> visitors won't have to just rely on the title alone for the content.
> >>>
> >>> Here's the rub though. As the content is in HTML form, I can't  
> >>> just grab
> >>> the first 100 characters and display them as that could leave an  
> >>> open
> >>> tag  without a closing one, potentially breaking the page. I could  
> >>> use
> >>> strip_tags on the 100-character excerpt, but what if the excerpt  
> >>> itself
> >>> broke a tag in half (i.e. <acronym title="something"> could become
> >>> <acron )
> >>>
> >>> The only solutions I can see are:
> >>>
> >>>
> >>>     * retrieve the entire article, perform a strip_tags and then  
> >>> take
> >>>       the excerpt
> >>>     * use a regex inside of mysql to pull out only the text
> >>>
> >>>
> >>> The thing is, neither of these seems particularly pretty, and I am  
> >>> sure
> >>> there's a better way, but it's too early in the week for my brain  
> >>> to be
> >>> fully functional I think!
> >>>
> >>> Does anyone have any ideas about what I could do, or do you think  
> >>> I'm
> >>> seeing problems where there are none?
> >>
> >> Use htmltidy or htmlpurifier to clean up things. I.e. grab the amount
> >> of content you want, then use one of the tools to repair and clean  
> >> the
> >> html.
> >>
> >> Regards
> >> Peter
> >>
> >> -- 
> >> <hype>
> >> WWW: http://plphp.dk / http://plind.dk
> >> LinkedIn: http://www.linkedin.com/in/plind
> >> Flickr: http://www.flickr.com/photos/fake51
> >> BeWelcome: Fake51
> >> Couchsurfing: Fake51
> >> </hype>
> >>
> >
> >
> > Would that work on content that stopped mid-tag? Assuming the original
> > copy is:
> >
> > <p>This is some sentence, with an <abbr title="Abbreviation">abbr</ 
> > abbr>
> > in the middle of it.</p>
> >
> > If I was asking for only the first 50 characters, I'd get this:
> >
> > <p>This is some sentence, with an <abbr title="Abb
> >
> > Would either htmltidy or htmlpurifier be able to handle that? I don't
> > mind whether it tries to repair the tag or remove it completely, as  
> > long
> > as it does something to it.
> >
> > Thanks,
> > Ash
> > http://www.ashleysheridan.co.uk
> >
> >
> 
> When looking at the performance side of things, couldn't you add  
> another column to the table and do this work to tidy / strip tags  
> during the insert going forward?
> 
> Any current data would need a one time script to clean / tidy the  
> current data. you could run this on a nightly cron ( depending on how  
> much data there is) until the new column is filled with clean data.
> 
> Bastien
> 
> Sent from my iPod
> 

That's not a bad idea actually, I hadn't thought of it! I'm kicking
myself now, because it's such an obvious solution!

Thanks,
Ash
http://www.ashleysheridan.co.uk