On Apr 26, 2010, at 7:23 AM, Ashley Sheridan
<ash@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote:
>
>> On 26 April 2010 12:52, Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx>
>> wrote:
>>> I've been thinking about this problem for a little while, and the
>>> thing
>>> is, I can think of ways of doing it, but they're not very nice,
>>> and I
>>> don't think they're going to be fast.
>>>
>>> Basically, I have a load of HTML formatted content in a database
>>> that
>>> get displayed onto the site. It's part of a rudimentary CMS.
>>>
>>> Currently, the titles for each article are displayed on a page,
>>> and each
>>> title links to the full article. However, that leaves me with a
page
>>> which is essentially a list of links, and that's not ideal for
>>> SEO. What
>>> I wanted to do to enhance the page is to have a short excerpt
of x
>>> number of words/characters beneath each article title. The idea
>>> being
>>> that search engines will find the page as more than a link
farm, and
>>> visitors won't have to just rely on the title alone for the
content.
>>>
>>> Here's the rub though. As the content is in HTML form, I can't
>>> just grab
>>> the first 100 characters and display them as that could leave an
>>> open
>>> tag without a closing one, potentially breaking the page. I
could
>>> use
>>> strip_tags on the 100-character excerpt, but what if the excerpt
>>> itself
>>> broke a tag in half (i.e. <acronym title="something"> could
become
>>> <acron )
>>>
>>> The only solutions I can see are:
>>>
>>>
>>> * retrieve the entire article, perform a strip_tags and then
>>> take
>>> the excerpt
>>> * use a regex inside of mysql to pull out only the text
>>>
>>>
>>> The thing is, neither of these seems particularly pretty, and I
am
>>> sure
>>> there's a better way, but it's too early in the week for my brain
>>> to be
>>> fully functional I think!
>>>
>>> Does anyone have any ideas about what I could do, or do you think
>>> I'm
>>> seeing problems where there are none?
>>
>> Use htmltidy or htmlpurifier to clean up things. I.e. grab the
amount
>> of content you want, then use one of the tools to repair and clean
>> the
>> html.
>>
>> Regards
>> Peter
>>
>> --
>> <hype>
>> WWW: http://plphp.dk / http://plind.dk
>> LinkedIn: http://www.linkedin.com/in/plind
>> Flickr: http://www.flickr.com/photos/fake51
>> BeWelcome: Fake51
>> Couchsurfing: Fake51
>> </hype>
>>
>
>
> Would that work on content that stopped mid-tag? Assuming the
original
> copy is:
>
> <p>This is some sentence, with an <abbr title="Abbreviation">abbr</
> abbr>
> in the middle of it.</p>
>
> If I was asking for only the first 50 characters, I'd get this:
>
> <p>This is some sentence, with an <abbr title="Abb
>
> Would either htmltidy or htmlpurifier be able to handle that? I
don't
> mind whether it tries to repair the tag or remove it completely, as
> long
> as it does something to it.
>
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>
>
When looking at the performance side of things, couldn't you add
another column to the table and do this work to tidy / strip tags
during the insert going forward?
Any current data would need a one time script to clean / tidy the
current data. you could run this on a nightly cron ( depending on how
much data there is) until the new column is filled with clean data.
Bastien
Sent from my iPod