Re: Re: Extract printable text from web page using preg_match

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 28-Feb-07, at 1:48 AM, Colin Guthrie wrote:

M5 wrote:
No, it's not a very good solution. Striptags will leave everything
within <head>, <style> and <script> (in the body or out). Comments are
also included.

I know it's possible to use non reg-ex strpos/substr to extra everything
within <body>, but as another poster correctly said, this assumes a
consistent HTML document (which there is not).

I realize now that such a regex would be rather sophisticated, but I
thought surely it must exist, since text-scrapping the readable content
of a web page must not be rare.

Said it before, but low-tech solution is to use program "lynx" with the -dump argument and capture the output back to PHP. I'm assuming you are on Linux or OSX I guess as I've not heard of using lynx on windows.....

Thanks, that sounds like a good direction. And yes, I'm on OS X.

There are loads of command line options to control the way lynx displays
the output so you have a very fine grain of control here.

Col

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux