Re: Good HTML parser needed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 13, 2008 at 6:06 AM, Per Jessen <per@xxxxxxxxxxxx> wrote:> Shelley wrote:>>> I want to know whether there are some good HTML parsers written in>> PHP.>>>> That is,>> the parser checks whether html tags like table, tr, td, div, dt, dl,>> dd, script, ul, li, span, h1, h2, etc. are nested correctly.>> If any tags not matched, just remove them.>> Except for the last part, any XML parser will do.  Sablotron, xalan,> libxsl etc.>>> /Per Jessen, Zürich
... except when the HTML is not well formed XML, as I find is oftenthe case when accepting input from users. That "last part," as yousay, is kind of essential. It could be as simple as tags that don'tclose in HTML (e.g. <img>, <br>, <hr>) or it could be something muchtrickier to clean up such as mismatched tags, improper nesting,missing closing tags (since some browsers are too forgiving of notclosing <td>, <li> or <option>), HTML entities that are not valid inXML, etc. In these cases, the DOM-type parsers will usually choke. Youmight be able to salvage something with the stream-based parsers likeSAX. (I've never tried it.)
Andrew

[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux