Re: Scraping Multiple sites

chris h <chris404@xxxxxxxxx> · Sun, 3 Oct 2010 00:21:45 -0400

On Sat, Oct 2, 2010 at 9:03 PM, Russell Dias <rus321@xxxxxxxxx> wrote:

> I'm currently stuck on a little problem. I'm using cURL in conjunction
> with DOMDocument and Xpath to scrape data from a couple of websites.
> Please note that is only for personal and educational purposes.
>
> Right now I have 5 independent scripts (that traverse through 5
> websites) that run via a cron tab every 12 hours. However, as you may
> have guessed this is a scalability nightmare. If my list of websites
> to scrape grows I have to create another independent script and run it
> via cron.
>
> My knowledge of OOP is fairly basic as I have just gotten started with
> it. However, could anyone perhaps suggest a design pattern that would
> suit my needs? My solution would be to create an abstract class for
> the web crawler and then simply extend it per website I add on.
> However, as I said my experience with OOP is almost non-existant
> therefore I have no idea how this would scale. I want this 'crawler'
> to be one application which can run via one cron rather than having n
> amount of scripts for each websites and having to manually create a
> cron each time.
>
> Or does anyone have any experience with this sort thing and could
> maybe offer some advice?
>
> I'm not limited to using PHP either, however due to hosting
> constraints Python would most likely be my only other alternative.
>
> Any help would be appreciated.
>
> Cheers,
> Russell
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>

Are the sites that you are crawling so different as to justify
maintaining separate chunks of code for each one?  I would try to avoid
having any code specific to a site, otherwise scaling your application to
support even a hundred sites would involve overlapping hundreds of points of
functionality and be a logistical nightmare.  Unless you're simply wanting
to do this for educational reasons...

My suggestion would be to attempt to create an application that can craw all
the sites, without specifics for each one.  You could fire it with a single
cron job, and give it a list of the urls you want it to hit.  It can crawl
one url, record the findings, move to the next, repeat.

Chris.