Re: spider

"TG" <tg-php@xxxxxxxxxxxxxxxxxxxxxx> · Fri, 21 Mar 2008 15:36:22 -0400

Ok, so the CURL and WGET stuff has been mentioned, but I don't think that 
really addresses your question.  You didn't ask what the "best way" to do 
this, you asked how you would do it in PHP.

Here's what I would consider to be the 'theory' of the exercise:

* Do we obey robots.txt? If so, get the specs for that and keep it in mind 
during your code building

* Get parameters:  Starting point, how 'deep' to go, stay within that 
domain/subdomain or is it ok to leave?  Are we retrieving content or just 
structure or info (like permissions)?

* Go to starting point, retrieve links.  Figure out the best way to 'crawl'.  
Maybe check out recursive directory crawling routines.  I made a duplicate 
file scanner in PHP and probably did it the hard way, but bsaically I 
created an array of directory paths and marked them scanned or not.  In 
this case, if you already scanned a link, you could skip it.  Keep going 
through your link list until none come up as "not scanned".    So retrieve 
links from current page, add links to scan array (if they match allowable 
criteria), check link array to see if anything is unscanned.  Later, rinse, 
repeat.

Since you're talking about a remote site, you're limited to what the remote 
web server is going to give you as far as info goes.  You might get an 
"HTTP Auth Required" code, but there probably won't be a link to something 
that is in a restricted directory, so that's kind of a moot point.

To get directory permissions, you'd really have to go through either FTP or a 
terminal connection (telnet, etc).

If you're doing it as a coding exercise, have fun.  If not, there are already 
tons of tools that could help you.  Years ago I used a program called 
Teleport Pro (for Windows) that could either retrieve or create a sitemap 
or any number of variations.  I did use it once or twice to check for bad 
links on the website (bad = going to invalid pages but also bad = pointing 
to old/restricted parts of the website without requiring authorization).

That's a really basic site security check.  I know you know you should use 
good coding practices and use more intensive site security scanning tools.

-TG

----- Original Message -----
From: tedd <tedd@xxxxxxxxxxxx>
To: php-general@xxxxxxxxxxxxx
Date: Fri, 21 Mar 2008 13:52:38 -0400
Subject:  spider

> Hi gang:
> 
> How do you spider a remote web site in php?
> 
> I get the general idea, which is to take the root page, strip out the 
> links and repeat the process on those links. But, what's the code? 
> Does anyone have an example they can share or a direction for me to 
> take?
> 
> Also, is there a way to spider through a remote web site gathering 
> directory permissions?
> 
> I know there are applications, such as Site-sucker, that will travel 
> a remote web site looking for anything that it can download and if 
> found, do so. But is there a way to determine what the permissions 
> are for those directories?
> 
> If not, can one attempt to write a file and record the 
> failures/successes (0777 directories)?
> 
> What I am trying to do is to develop a way to test if a web site is 
> secure or not. I'm not trying to develop evil code, but if it can be 
> done then I want to know how.
> 
> Thanks and Cheers,
> 
> tedd

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php