On Tue, May 13, 2008 at 6:06 AM, Per Jessen <per@xxxxxxxxxxxx> wrote:> Shelley wrote:>>> I want to know whether there are some good HTML parsers written in>> PHP.>>>> That is,>> the parser checks whether html tags like table, tr, td, div, dt, dl,>> dd, script, ul, li, span, h1, h2, etc. are nested correctly.>> If any tags not matched, just remove them.>> Except for the last part, any XML parser will do. Sablotron, xalan,> libxsl etc.>>> /Per Jessen, Zürich ... except when the HTML is not well formed XML, as I find is oftenthe case when accepting input from users. That "last part," as yousay, is kind of essential. It could be as simple as tags that don'tclose in HTML (e.g. <img>, <br>, <hr>) or it could be something muchtrickier to clean up such as mismatched tags, improper nesting,missing closing tags (since some browsers are too forgiving of notclosing <td>, <li> or <option>), HTML entities that are not valid inXML, etc. In these cases, the DOM-type parsers will usually choke. Youmight be able to salvage something with the stream-based parsers likeSAX. (I've never tried it.) Andrew