Scraping Multiple sites

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm currently stuck on a little problem. I'm using cURL in conjunction
with DOMDocument and Xpath to scrape data from a couple of websites.
Please note that is only for personal and educational purposes.

Right now I have 5 independent scripts (that traverse through 5
websites) that run via a cron tab every 12 hours. However, as you may
have guessed this is a scalability nightmare. If my list of websites
to scrape grows I have to create another independent script and run it
via cron.

My knowledge of OOP is fairly basic as I have just gotten started with
it. However, could anyone perhaps suggest a design pattern that would
suit my needs? My solution would be to create an abstract class for
the web crawler and then simply extend it per website I add on.
However, as I said my experience with OOP is almost non-existant
therefore I have no idea how this would scale. I want this 'crawler'
to be one application which can run via one cron rather than having n
amount of scripts for each websites and having to manually create a
cron each time.

Or does anyone have any experience with this sort thing and could
maybe offer some advice?

I'm not limited to using PHP either, however due to hosting
constraints Python would most likely be my only other alternative.

Any help would be appreciated.

Cheers,
Russell

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux