Re: Extract printable text from web page using preg_match

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



----- Original Message ----- From: "M5" <m5@xxxxxxxxxxxxxxxx>
To: <php-general@xxxxxxxxxxxxx>
Sent: Tuesday, February 27, 2007 6:47 PM
Subject:  Extract printable text from web page using preg_match


I am trying to write a regex function to extract the readable
(visible, screen-rendered) portion of any web page. Specifically, I
only want the text between the <body> tags, excluding any <script> or
<style> tags within the document, also excluding comments. Has anyone
here seen such a regex? Is it possible to do in one expression?

...Rene


Though it might be possible to do by handling the HTML document as a plain string, the fact is lots of HTML is nor well formatted. As sugested, using strpos for the start and end tags might be a faster and easier solution than a regex, but no browser will complain if it reaches the end of the document and found no closing body tag. An initial body tag might have lots of attributes, such as onLoad, class, style, target .... you name it, you'll eventually find it. Unless you know the 'quality' of the documents you plan to parse, I would sugest you use a DOM parser. PHP ships with two alternatives, depending on the version, which might or might not be installed in your setup, otherwise, you'll have to google for something you may use.

Satyam

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux