On Apr 26, 2010, at 7:23 AM, Ashley Sheridan
<ash@xxxxxxxxxxxxxxxxxxxx> wrote:
On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote:
On 26 April 2010 12:52, Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx>
wrote:
I've been thinking about this problem for a little while, and the
thing
is, I can think of ways of doing it, but they're not very nice,
and I
don't think they're going to be fast.
Basically, I have a load of HTML formatted content in a database
that
get displayed onto the site. It's part of a rudimentary CMS.
Currently, the titles for each article are displayed on a page,
and each
title links to the full article. However, that leaves me with a page
which is essentially a list of links, and that's not ideal for
SEO. What
I wanted to do to enhance the page is to have a short excerpt of x
number of words/characters beneath each article title. The idea
being
that search engines will find the page as more than a link farm, and
visitors won't have to just rely on the title alone for the content.
Here's the rub though. As the content is in HTML form, I can't
just grab
the first 100 characters and display them as that could leave an
open
tag without a closing one, potentially breaking the page. I could
use
strip_tags on the 100-character excerpt, but what if the excerpt
itself
broke a tag in half (i.e. <acronym title="something"> could become
<acron )
The only solutions I can see are:
* retrieve the entire article, perform a strip_tags and then
take
the excerpt
* use a regex inside of mysql to pull out only the text
The thing is, neither of these seems particularly pretty, and I am
sure
there's a better way, but it's too early in the week for my brain
to be
fully functional I think!
Does anyone have any ideas about what I could do, or do you think
I'm
seeing problems where there are none?
Use htmltidy or htmlpurifier to clean up things. I.e. grab the amount
of content you want, then use one of the tools to repair and clean
the
html.
Regards
Peter
--
<hype>
WWW: http://plphp.dk / http://plind.dk
LinkedIn: http://www.linkedin.com/in/plind
Flickr: http://www.flickr.com/photos/fake51
BeWelcome: Fake51
Couchsurfing: Fake51
</hype>
Would that work on content that stopped mid-tag? Assuming the original
copy is:
<p>This is some sentence, with an <abbr title="Abbreviation">abbr</
abbr>
in the middle of it.</p>
If I was asking for only the first 50 characters, I'd get this:
<p>This is some sentence, with an <abbr title="Abb
Would either htmltidy or htmlpurifier be able to handle that? I don't
mind whether it tries to repair the tag or remove it completely, as
long
as it does something to it.
Thanks,
Ash
http://www.ashleysheridan.co.uk
When looking at the performance side of things, couldn't you add
another column to the table and do this work to tidy / strip tags
during the insert going forward?
Any current data would need a one time script to clean / tidy the
current data. you could run this on a nightly cron ( depending on how
much data there is) until the new column is filled with clean data.
Bastien
Sent from my iPod
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php