I'm writing a custom search engine for my site, it seemed easier than
modifying sphyder (what I currently use) to do what I want especially
since sphyder has a lot of stuff that isn't personally of use to me.
One of the things I want to do when I index is list external links and
check them.
The idea is to have curl download just the headers but not content from
external links.
This is what I have as part of my class to do that -
function meta($url) {
$process = curl_init($url);
curl_setopt($process, CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($process, CURLOPT_TIMEOUT, 20);
curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers);
curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent);
curl_setopt($process, CURLOPT_NOBODY, true);
curl_setopt($process, CURLOPT_HEADER, 0);
$fetch = curl_exec($process);
$return[] = curl_getinfo($process, CURLINFO_HTTP_CODE);
$meta = split(';',curl_getinfo($process, CURLINFO_CONTENT_TYPE));
$return[] = $meta[0];
curl_close($process);
return $return;
}
I am under the impression that
curl_setopt($process, CURLOPT_NOBODY, true);
does what I want - but the curl docs can be confusing.
Will that work to just download the headers needed to get the http
status code and mime type w/o grabbing content?
And secondly, will making 40 or so connections to the same remote site
just to grab headers but not content (there are two I link to quite a
bit with permission) to check for moved files possibly cause issues with
their server software? It doesn't seem to for me (Apache on Linux) but
that's me, and I'm not positive curl stopped the download after getting
last http header.
Pages on those sites do move as taxonomy moves and the people in charge
don't seem to keep 301 moved redirects in place when they reorganize, so
I do need to check with some frequency, but I don't want to cause problems.
Thanks for suggestions.
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php