Re: generating an html intro text ...

Jochem Maas <jochem@xxxxxxxxxxxxx> · Mon, 18 Jun 2007 16:18:56 +0200

Stut wrote:
> Jochem Maas wrote:
>> Stut wrote:
>>> I'm sure Tidy could be employed to do this job. Grab your target length
>>> of text, backtrack until you find < or >. If it's a < then chop that bit
>>> off. Then give it to Tidy to fix the HTML. That should close off any
>>> open tags and give you a properly formed snippet.
>>>
>>> Or you could implement the same functionality yourself quite easily, and
>>> it doesn't need to be well-formed HTML but you would need to check for
>>> tags that don't need closing (img, br, etc).
>>
>> Stut, I going to go out on a limb and say 'bullshit' to those comments,
>> no personal offence meant (I mostly value your input *alot*, but I
>> guess we
>> can't agree all the time):
> 
> None taken, 

good.

> but I think you're making this into a bigger problem than it
> is.

okay.

> 
>> 1. simply backtracking to a '<' or '>' doesn't account for unescaped
>> output
>> within the content pieces/parts of the html, e.g.:
>>
>>     <P>look this here character '<' ammounts to invalid html<p>
> 
> If the content is not escaped then it's not going to get displayed
> properly by the browser, so why worry about it?

because plenty/most browser do display it 'properly' even though the actual
HTML is valid - most of my clients use that POS called IE - which does an amazing
job at turing dogshit tagsoup (especially M$ proprietary generate cruft - think FrontPage, Word,
Publisher etc) into something 'pretty' in the browser window (and usually displaying what
the user/client expected ... we'll not go it the flipside where valid HTML is borked when output :-/

the client cares that things always display correctly - regardless of the cruft that
marketing drone X input into the [internal company] backend system .. when it's 10-30K euros in turnover
a day that's understandable.

> 
>> 2. backtracking in the manner you describe does account for a large
>> ammount of
>> hmtl at the beginning of the string - remember that the aim of the
>> game is
>> to truncate the user visible content of the html string in question
>> (the comment below would be required to be part of the output - think
>> html comment
>> spam, which is quite effective and some clients state as a requirement):
>>
>>     <!-- imagine this comment is 1000s of chars long --><P>look at
>> this</P>
> 
> Comments are easily removed before snippet generation with a simple regex.
> 
>> you say 'you could implement the same functionality yourself quite
>> easily' - if
>> that was true then why does nobody have a readymade solution - it's
>> not as if the
>> requirement is 'way out there' as far as web development goes. I don't
>> think creating routines
>> for doing advanced manipulation of XML/HTML strings 'properly' is
>> something very simple,
>> something that can be atested to by the relative complexity of libtidy
>> and libxml2.
>>
>> I not looking for a hack job here, I need something that can be relied
>> on to output
>> html strings that will not break the validity (& layout) of a page
>> regardless of the
>> quality of input.
>>
>> the more I think about it the more I inclined to think that Tidy is
>> about to become
>> a close friend. :-)
> 
> If I were you I would definitely be looking at Tidy because it's been
> well tested in the real world and will save a huge amount of effort.
> However, I don't think writing code to do this amounts to a hack. The
> rules of markup are very clear and with a little thought you could
> easily handle the vast majority of possible situations.

probably but 'vast majority' != 'ALL' - and I garantee marketing drone X will
break the system before I can turn my back, murphy's law and all that!

1 word, short word, 2 syllables, first syllable sounds like 'sky'. :-)

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php