curl spider and being a good citizen

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm writing a custom search engine for my site, it seemed easier than modifying sphyder (what I currently use) to do what I want especially since sphyder has a lot of stuff that isn't personally of use to me.

One of the things I want to do when I index is list external links and check them.

The idea is to have curl download just the headers but not content from external links.

This is what I have as part of my class to do that -

function meta($url) {
      $process = curl_init($url);
      curl_setopt($process, CURLOPT_CONNECTTIMEOUT, 15);
      curl_setopt($process, CURLOPT_TIMEOUT, 20);

      curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers);
      curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent);
      curl_setopt($process, CURLOPT_NOBODY, true);
      curl_setopt($process, CURLOPT_HEADER, 0);

      $fetch = curl_exec($process);

      $return[] = curl_getinfo($process, CURLINFO_HTTP_CODE);
      $meta = split(';',curl_getinfo($process, CURLINFO_CONTENT_TYPE));
      $return[] = $meta[0];

      curl_close($process);
      return $return;
      }

I am under the impression that
curl_setopt($process, CURLOPT_NOBODY, true);

does what I want - but the curl docs can be confusing.

Will that work to just download the headers needed to get the http status code and mime type w/o grabbing content?

And secondly, will making 40 or so connections to the same remote site just to grab headers but not content (there are two I link to quite a bit with permission) to check for moved files possibly cause issues with their server software? It doesn't seem to for me (Apache on Linux) but that's me, and I'm not positive curl stopped the download after getting last http header.

Pages on those sites do move as taxonomy moves and the people in charge don't seem to keep 301 moved redirects in place when they reorganize, so I do need to check with some frequency, but I don't want to cause problems.

Thanks for suggestions.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux