curl spider and being a good citizen

"Michael A. Peters" <mpeters@xxxxxxx> · Sat, 24 Oct 2009 07:10:33 -0700

I'm writing a custom search engine for my site, it seemed easier than 
modifying sphyder (what I currently use) to do what I want especially 
since sphyder has a lot of stuff that isn't personally of use to me.

One of the things I want to do when I index is list external links and 
check them.

The idea is to have curl download just the headers but not content from 
external links.

This is what I have as part of my class to do that -

function meta($url) {
      $process = curl_init($url);
      curl_setopt($process, CURLOPT_CONNECTTIMEOUT, 15);
      curl_setopt($process, CURLOPT_TIMEOUT, 20);

      curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers);
      curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent);
      curl_setopt($process, CURLOPT_NOBODY, true);
      curl_setopt($process, CURLOPT_HEADER, 0);

      $fetch = curl_exec($process);

      $return[] = curl_getinfo($process, CURLINFO_HTTP_CODE);
      $meta = split(';',curl_getinfo($process, CURLINFO_CONTENT_TYPE));
      $return[] = $meta[0];

      curl_close($process);
      return $return;
      }

I am under the impression that
curl_setopt($process, CURLOPT_NOBODY, true);

does what I want - but the curl docs can be confusing.

Will that work to just download the headers needed to get the http 
status code and mime type w/o grabbing content?

And secondly, will making 40 or so connections to the same remote site 
just to grab headers but not content (there are two I link to quite a 
bit with permission) to check for moved files possibly cause issues with 
their server software? It doesn't seem to for me (Apache on Linux) but 
that's me, and I'm not positive curl stopped the download after getting 
last http header.

Pages on those sites do move as taxonomy moves and the people in charge 
don't seem to keep 301 moved redirects in place when they reorganize, so 
I do need to check with some frequency, but I don't want to cause problems.

Thanks for suggestions.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php