On Mon, 2010-04-26 at 13:20 +0200, Peter Lind wrote: > On 26 April 2010 12:52, Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx> wrote: > > I've been thinking about this problem for a little while, and the thing > > is, I can think of ways of doing it, but they're not very nice, and I > > don't think they're going to be fast. > > > > Basically, I have a load of HTML formatted content in a database that > > get displayed onto the site. It's part of a rudimentary CMS. > > > > Currently, the titles for each article are displayed on a page, and each > > title links to the full article. However, that leaves me with a page > > which is essentially a list of links, and that's not ideal for SEO. What > > I wanted to do to enhance the page is to have a short excerpt of x > > number of words/characters beneath each article title. The idea being > > that search engines will find the page as more than a link farm, and > > visitors won't have to just rely on the title alone for the content. > > > > Here's the rub though. As the content is in HTML form, I can't just grab > > the first 100 characters and display them as that could leave an open > > tag without a closing one, potentially breaking the page. I could use > > strip_tags on the 100-character excerpt, but what if the excerpt itself > > broke a tag in half (i.e. <acronym title="something"> could become > > <acron ) > > > > The only solutions I can see are: > > > > > > * retrieve the entire article, perform a strip_tags and then take > > the excerpt > > * use a regex inside of mysql to pull out only the text > > > > > > The thing is, neither of these seems particularly pretty, and I am sure > > there's a better way, but it's too early in the week for my brain to be > > fully functional I think! > > > > Does anyone have any ideas about what I could do, or do you think I'm > > seeing problems where there are none? > > Use htmltidy or htmlpurifier to clean up things. I.e. grab the amount > of content you want, then use one of the tools to repair and clean the > html. > > Regards > Peter > > -- > <hype> > WWW: http://plphp.dk / http://plind.dk > LinkedIn: http://www.linkedin.com/in/plind > Flickr: http://www.flickr.com/photos/fake51 > BeWelcome: Fake51 > Couchsurfing: Fake51 > </hype> > Would that work on content that stopped mid-tag? Assuming the original copy is: <p>This is some sentence, with an <abbr title="Abbreviation">abbr</abbr> in the middle of it.</p> If I was asking for only the first 50 characters, I'd get this: <p>This is some sentence, with an <abbr title="Abb Would either htmltidy or htmlpurifier be able to handle that? I don't mind whether it tries to repair the tag or remove it completely, as long as it does something to it. Thanks, Ash http://www.ashleysheridan.co.uk