Search squid archive

Re: Reverse Proxy and Googlebot

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Simon Waters wrote:
On Monday 06 October 2008 11:55:41 Amos Jeffries wrote:
Simon Waters wrote:
Seeing issues with Googlebots retrying on large PDF files.

Apache logs a 200 for the HTTP 1.0 requests.

Squid logs an HTTP 1.1 request that looks to have stopped early (3MB out
of 13MB).

This pattern is repeated with slight variation in the amount of data
served to the Googlebots, and after about 14 attempts it gives up and
goes away.

Anyone else seeing same?
Not seeing this, but....  do you have correct Expires: and Cache-Control
headers on those .pdf? and is GoogleBot not obeying them?

Yes Etags and Expires headers - I don't think this is Squid specific since I saw similar from Googlebots before there was a reverse proxy involved.

I agree. I just thought it might be their way of self-detecting unchanged content of the headers were missing. BUut it seems not.


Does have a "Vary: Host" header, I know how it got there but I'm not 100% sure what if any effect it has on caching, I'm hoping everything is ignoring it.

A different copy gets cached for each difference in Vary: listed headers. ETag should override that by meaning two variants are the identical.

Again may be relevant in general, but shouldn't be relevant to this request (since it is all from the same host).

http://groups.google.com/group/Google_Webmaster_Help-Indexing/browse_thread/thread/f8ecc41ac9e5bc11

I just thought because there is a Squid reverse proxy in front of the server I had more information on what was going wrong, and that others here might have seen something similar.

It looks like the Googlebot is timing out, and retrying. Quite why it is not getting the cache is unclear at this point, but since I can't control the Googlebot I can't reproduce with more logging. It also doesn't seem to back off any when it fails, which I think is the real issue here. Google showed some interest last time, but never got back to me.

I got TCP_MISS:FIRST_UP_PARENT logged on squid for all these requests.
Today when I checked the headers using wget I see TCP_REFRESH_HIT:FIRST_UP_PARENT, and TCP_HIT:NONE, so Squid seems to be doing something sensible with the file usually, just Googlebots it dislikes.

Would you expect Squid to cache the first 3MB if the HTTP 1.1 request stopped early?

Not separate form the rest of the file. You currently still need the quick_abort and related settings tuned to always fetch a whole object for squid to cache it.

Hmm, that might actually fix the issue for you come to think of it. If not it can be unset after an experiment.

Amos
--
Please use Squid 2.7.STABLE4 or 3.0.STABLE9

[Index of Archives]     [Linux Audio Users]     [Samba]     [Big List of Linux Books]     [Linux USB]     [Yosemite News]

  Powered by Linux