Re: Extract printable text from web page using preg_match

"Satyam" <Satyam@xxxxxxxxxxxxx> · Tue, 27 Feb 2007 19:51:39 +0100

----- Original Message ----- 
From: "M5" <m5@xxxxxxxxxxxxxxxx>
To: <php-general@xxxxxxxxxxxxx>
Sent: Tuesday, February 27, 2007 6:47 PM
Subject:  Extract printable text from web page using preg_match

I am trying to write a regex function to extract the readable
(visible, screen-rendered) portion of any web page. Specifically, I
only want the text between the <body> tags, excluding any <script> or
<style> tags within the document, also excluding comments. Has anyone
here seen such a regex? Is it possible to do in one expression?

...Rene

Though it might be possible to do by handling the HTML document as a plain 
string, the fact is lots of HTML is nor well formatted.  As sugested, using 
strpos for the start and end tags might be a faster and easier solution than 
a regex, but no browser will complain if it reaches the end of the document 
and found no closing body tag.  An initial body tag might have lots of 
attributes, such as onLoad, class, style, target .... you name it, you'll 
eventually find it.  Unless you know the 'quality' of the documents you plan 
to parse, I would sugest you use a DOM parser.   PHP ships with two 
alternatives, depending on the version, which might or might not be 
installed in your setup, otherwise, you'll have to google for something you 
may use.

Satyam

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php