On Mon, 2010-04-26 at 07:58 -0400, Phpster wrote: > > On Apr 26, 2010, at 7:23 AM, Ashley Sheridan > <ash@xxxxxxxxxxxxxxxxxxxx> wrote: > > > On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote: > > > >> On 26 April 2010 12:52, Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx> > >> wrote: > >>> I've been thinking about this problem for a little while, and the > >>> thing > >>> is, I can think of ways of doing it, but they're not very nice, > >>> and I > >>> don't think they're going to be fast. > >>> > >>> Basically, I have a load of HTML formatted content in a database > >>> that > >>> get displayed onto the site. It's part of a rudimentary CMS. > >>> > >>> Currently, the titles for each article are displayed on a page, > >>> and each > >>> title links to the full article. However, that leaves me with a page > >>> which is essentially a list of links, and that's not ideal for > >>> SEO. What > >>> I wanted to do to enhance the page is to have a short excerpt of x > >>> number of words/characters beneath each article title. The idea > >>> being > >>> that search engines will find the page as more than a link farm, and > >>> visitors won't have to just rely on the title alone for the content. > >>> > >>> Here's the rub though. As the content is in HTML form, I can't > >>> just grab > >>> the first 100 characters and display them as that could leave an > >>> open > >>> tag without a closing one, potentially breaking the page. I could > >>> use > >>> strip_tags on the 100-character excerpt, but what if the excerpt > >>> itself > >>> broke a tag in half (i.e. <acronym title="something"> could become > >>> <acron ) > >>> > >>> The only solutions I can see are: > >>> > >>> > >>> * retrieve the entire article, perform a strip_tags and then > >>> take > >>> the excerpt > >>> * use a regex inside of mysql to pull out only the text > >>> > >>> > >>> The thing is, neither of these seems particularly pretty, and I am > >>> sure > >>> there's a better way, but it's too early in the week for my brain > >>> to be > >>> fully functional I think! > >>> > >>> Does anyone have any ideas about what I could do, or do you think > >>> I'm > >>> seeing problems where there are none? > >> > >> Use htmltidy or htmlpurifier to clean up things. I.e. grab the amount > >> of content you want, then use one of the tools to repair and clean > >> the > >> html. > >> > >> Regards > >> Peter > >> > >> -- > >> <hype> > >> WWW: http://plphp.dk / http://plind.dk > >> LinkedIn: http://www.linkedin.com/in/plind > >> Flickr: http://www.flickr.com/photos/fake51 > >> BeWelcome: Fake51 > >> Couchsurfing: Fake51 > >> </hype> > >> > > > > > > Would that work on content that stopped mid-tag? Assuming the original > > copy is: > > > > <p>This is some sentence, with an <abbr title="Abbreviation">abbr</ > > abbr> > > in the middle of it.</p> > > > > If I was asking for only the first 50 characters, I'd get this: > > > > <p>This is some sentence, with an <abbr title="Abb > > > > Would either htmltidy or htmlpurifier be able to handle that? I don't > > mind whether it tries to repair the tag or remove it completely, as > > long > > as it does something to it. > > > > Thanks, > > Ash > > http://www.ashleysheridan.co.uk > > > > > > When looking at the performance side of things, couldn't you add > another column to the table and do this work to tidy / strip tags > during the insert going forward? > > Any current data would need a one time script to clean / tidy the > current data. you could run this on a nightly cron ( depending on how > much data there is) until the new column is filled with clean data. > > Bastien > > Sent from my iPod > That's not a bad idea actually, I hadn't thought of it! I'm kicking myself now, because it's such an obvious solution! Thanks, Ash http://www.ashleysheridan.co.uk