----- Original Message -----
From: "M5" <m5@xxxxxxxxxxxxxxxx>
To: <php-general@xxxxxxxxxxxxx>
Sent: Tuesday, February 27, 2007 6:47 PM
Subject: Extract printable text from web page using preg_match
I am trying to write a regex function to extract the readable
(visible, screen-rendered) portion of any web page. Specifically, I
only want the text between the <body> tags, excluding any <script> or
<style> tags within the document, also excluding comments. Has anyone
here seen such a regex? Is it possible to do in one expression?
...Rene
Though it might be possible to do by handling the HTML document as a plain
string, the fact is lots of HTML is nor well formatted. As sugested, using
strpos for the start and end tags might be a faster and easier solution than
a regex, but no browser will complain if it reaches the end of the document
and found no closing body tag. An initial body tag might have lots of
attributes, such as onLoad, class, style, target .... you name it, you'll
eventually find it. Unless you know the 'quality' of the documents you plan
to parse, I would sugest you use a DOM parser. PHP ships with two
alternatives, depending on the version, which might or might not be
installed in your setup, otherwise, you'll have to google for something you
may use.
Satyam
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php