Ok, so the CURL and WGET stuff has been mentioned, but I don't think that really addresses your question. You didn't ask what the "best way" to do this, you asked how you would do it in PHP. Here's what I would consider to be the 'theory' of the exercise: * Do we obey robots.txt? If so, get the specs for that and keep it in mind during your code building * Get parameters: Starting point, how 'deep' to go, stay within that domain/subdomain or is it ok to leave? Are we retrieving content or just structure or info (like permissions)? * Go to starting point, retrieve links. Figure out the best way to 'crawl'. Maybe check out recursive directory crawling routines. I made a duplicate file scanner in PHP and probably did it the hard way, but bsaically I created an array of directory paths and marked them scanned or not. In this case, if you already scanned a link, you could skip it. Keep going through your link list until none come up as "not scanned". So retrieve links from current page, add links to scan array (if they match allowable criteria), check link array to see if anything is unscanned. Later, rinse, repeat. Since you're talking about a remote site, you're limited to what the remote web server is going to give you as far as info goes. You might get an "HTTP Auth Required" code, but there probably won't be a link to something that is in a restricted directory, so that's kind of a moot point. To get directory permissions, you'd really have to go through either FTP or a terminal connection (telnet, etc). If you're doing it as a coding exercise, have fun. If not, there are already tons of tools that could help you. Years ago I used a program called Teleport Pro (for Windows) that could either retrieve or create a sitemap or any number of variations. I did use it once or twice to check for bad links on the website (bad = going to invalid pages but also bad = pointing to old/restricted parts of the website without requiring authorization). That's a really basic site security check. I know you know you should use good coding practices and use more intensive site security scanning tools. -TG ----- Original Message ----- From: tedd <tedd@xxxxxxxxxxxx> To: php-general@xxxxxxxxxxxxx Date: Fri, 21 Mar 2008 13:52:38 -0400 Subject: spider > Hi gang: > > How do you spider a remote web site in php? > > I get the general idea, which is to take the root page, strip out the > links and repeat the process on those links. But, what's the code? > Does anyone have an example they can share or a direction for me to > take? > > Also, is there a way to spider through a remote web site gathering > directory permissions? > > I know there are applications, such as Site-sucker, that will travel > a remote web site looking for anything that it can download and if > found, do so. But is there a way to determine what the permissions > are for those directories? > > If not, can one attempt to write a file and record the > failures/successes (0777 directories)? > > What I am trying to do is to develop a way to test if a web site is > secure or not. I'm not trying to develop evil code, but if it can be > done then I want to know how. > > Thanks and Cheers, > > tedd -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php