Re: html analyzer

Bill Guion <bguion@xxxxxxxxxxx> · Wed, 19 May 2010 13:24:35 -0400

At 12:30 AM +0200 5/19/10, Rene Veerman wrote:

Hi.

I'm trying to build a html analyzer that looks at natural words in html text.

I'd like to build a routine that walks through the HTML character by
character, but i'm not sure on how to properly walk through escaped "
and ' characters in javascript or other embedded languages. Skipping
the first " and ' is no problem, but after that, the escaped " and ',
they can get difficult imo.

If you have any ideas on this i'd like to hear 'm..

--
---------------------------------
Greetings from Rene7705,

My free open source webcomponents:
  http://code.google.com/u/rene7705/
  http://mediabeez.ws/downloads (and demos)

http://www.facebook.com/rene7705
---------------------------------

Renee,

I agree with the previous post - what you want to do is non-trivial. 
However, to address your question: one approach is to create a single 
quote flag (sqf) and a double quote flag (dqf). When you encounter 
the first quote, set that flag. When you encounter the second quote 
of the same type, clear the flag. At the end, both flags should be 
clear, or the html is mal-formed. You can also get more sophisticated 
and verify that you do not encounter a single, double, single 
sequence, or a double, single, double sequence. That gets more 
involved by remembering which quote was first, second, and third - 
third should be same as second, for example.

     -----===== Bill =====-----
--

Don't find fault. Find a remedy. - Henry Ford

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php