Search squid archive

Re: Mime.conf

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jason Spegal wrote:
On 1/18/2010 8:55 PM, Amos Jeffries wrote:
On Mon, 18 Jan 2010 13:18:20 -0500, Jason Spegal<jspegal@xxxxxxxxxxx>
wrote:
Alrighty. Did some more research and found a solution to my problem
which leads to another issue.

My problem: I was trying to serve a proxy auto configuration file
(wpad.dat) from an internal webserver (http://wpad/). When the client
down the pipe after squid picked it up the file was served with the mime
     type chemical/x-mopac-input. When I went direct to the webserver it
served the correct mime type (which I had forced it to).

Solution: On Gentoo squid is using the /etc/mime.types file to guess the
mime type instead of what the remote webserver is saying the file is. I
Point 1: Squid does not do that. Does not use mime.types at all.

Content-Type headers are passed through unchanged from what is received
unless administratively changed by header_replace.
Taken from access.log

Before changing mime.types

1263657638.249 0 10.10.122.248 TCP_MEM_HIT/200 670 GET http://wpad/wpad.dat - NONE/- chemical/x-mopac-input 1263661679.834 0 10.10.122.239 TCP_MEM_HIT/200 670 GET http://wpad/wpad.dat - NONE/- chemical/x-mopac-input 1263662648.054 9 10.10.122.248 TCP_CLIENT_REFRESH_MISS/200 654 GET http://wpad/wpad.dat - DIRECT/10.10.122.250 chemical/x-mopac-input 1263662742.482 4 10.10.122.248 TCP_CLIENT_REFRESH_MISS/200 654 GET http://wpad/wpad.dat - DIRECT/10.10.122.250 chemical/x-mopac-input 1263662752.973 0 10.10.122.248 TCP_IMS_HIT/304 264 GET http://wpad/wpad.dat - NONE/- chemical/x-mopac-input 1263664740.203 0 10.10.122.248 TCP_MEM_HIT/200 669 GET http://wpad/wpad.dat - NONE/- chemical/x-mopac-input

After changing mime.types

1263834369.649 1 10.10.122.241 TCP_REFRESH_UNMODIFIED/200 647 GET http://wpad/wpad.dat - DIRECT/10.10.122.250 application/x-ns-proxy-autoconfig 1263834539.719 0 10.10.122.241 TCP_MEM_HIT/200 657 GET http://wpad/wpad.dat - NONE/- application/x-ns-proxy-autoconfig 1263834791.576 0 10.10.122.241 TCP_MEM_HIT/200 657 GET http://wpad/wpad.dat - NONE/- application/x-ns-proxy-autoconfig 1263834822.423 0 10.10.122.241 TCP_MEM_HIT/200 657 GET http://wpad/wpad.dat - NONE/- application/x-ns-proxy-autoconfig

This log contains what the web server passed Squid. Not what Squid passed the clients.
Q: Is the WPAD web server on the same box where you are altering mime.types?


I just double checked that (ForceType application/x-ns-proxy-autoconfig) in my apache vhost config is working correctly. Also apache's mime.types file is setup correctly for this particular item.
fixed the file which I also noticed has several other issues answering
my other other issue, my is 95% of my data being caught in the catch all
     refresh_pattern instead of the mime type ones.
Point 2: Squid does not accept mime types in the refresh_pattern
directive.
This explains a few things.
Are you _sure_ that:
  * the PAC file is not cached with old headers from before your changes?
Yes

I can only get Squid to produce the wrong mime type by altering refresh_pattern to the values you have in your config. With that done Squid very consistently insists on producing a HIT with the first mime header received, no matter how they change on the server or what cache controls are passed to Squid by the server.


  * the PAC file is actually being fetched from the web server you are
expecting?
Yes
  * this is an official build of Squid?
Yes, see below.
  * nobody has applied third-party patches to it?
(none of the official Gentoo patches change mime.types.
http://sources.gentoo.org/viewcvs.py/gentoo-x86/net-proxy/squid/files/)

Fairly sure.
What headers does this produce when run on the Squid box?
   squidclient -v -h wpad -p 80 /wpad.dat


I'm posting version and configuration at the bottom of this email. Refresh patterns will be changed after this email is sent. This is a standard gentoo install with the epoll USE flag.

[ebuild R ] net-proxy/squid-3.0.19 USE="caps epoll ldap mysql pam samba sqlite ssl -icap-client (-ipf-transparent) -kerberos -kqueue -logrotate* -nis (-pf-transparent) -postgres -radius -sasl (-selinux) -snmp -zero-penalty-hit" 0 kB

Okay. So no reason whatsoever why the mime type is changing.


(squidclient -v -h wpad -p 80 /wpad.dat) yeilds

headers: 'GET /wpad.dat HTTP/1.0
Accept: */*

'
HTTP/1.1 404 Not Found
Date: Tue, 19 Jan 2010 03:27:19 GMT
Server: Apache
Content-Length: 265
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /wpad.dat was not found on this server.</p>
<hr>
<address>Apache Server at localhost Port 80</address>
</body></html>


So I used GET instead.

(GET http://wpad/wpad.dat -USed)

GET http://wpad/wpad.dat
User-Agent: lwp-request/5.827 libwww-perl/5.831

GET http://wpad/wpad.dat --> 200 OK
Connection: close
Date: Tue, 19 Jan 2010 03:28:59 GMT
Accept-Ranges: bytes
Age: 412
ETag: "736a9e-119-47d6be3f06d80"
Server: Apache
Content-Length: 281
Content-Type: application/x-ns-proxy-autoconfig
Last-Modified: Mon, 18 Jan 2010 08:10:46 GMT
Client-Date: Tue, 19 Jan 2010 03:28:59 GMT
Client-Peer: 10.10.122.250:80
Client-Response-Num: 1

That reply appears to have gone through Squid. I'm particularly interested in the headers going _into_ Squid.

I think try this as well and compare to the above set.
  squidclient -v -h wpad -p 80 -j wpad /wpad.dat


Of note for other Gentoo&  Debian users: From mime.types #  This file is
part of the app-misc/mime-types package, which is based on debian's
"mime-support".

So my question is now; how do I force squid to use the mime-type
delivered by the remote webserver without killing mime.types and thus
breaking my system in new and unexpected ways?
The official releases of Squid pass content-type headers through
unchanged. Something is broken.
On 1/15/2010 8:22 PM, Amos Jeffries wrote:
Jason Spegal wrote:
Is mime.conf what is used by refresh_pattern when mime types are used
for the regex?
No.

refresh_pattern uses a text regex against the requested URL string.

mime.conf is used by FTP and Gopher directory display to show the
icons.
Amos
Squid Cache: Version 3.0.STABLE19
configure options: '--prefix=/usr' '--build=i686-pc-linux-gnu' '--host=i686-pc-linux-gnu' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--datadir=/usr/share' '--sysconfdir=/etc' '--localstatedir=/var/lib' '--sysconfdir=/etc/squid' '--libexecdir=/usr/libexec/squid' '--localstatedir=/var' '--datadir=/usr/share/squid' '--with-default-user=squid' '--enable-auth=basic,digest,negotiate,ntlm' '--enable-removal-policies=lru,heap' '--enable-digest-auth-helpers=password' '--enable-basic-auth-helpers=DB,PAM,LDAP,SMB,multi-domain-NTLM,getpwnam,NCSA,MSNT' '--enable-external-acl-helpers=ldap_group,wbinfo_group,ip_user,session,unix_group' '--enable-ntlm-auth-helpers=SMB,fakeauth' '--enable-negotiate-auth-helpers=' '--enable-useragent-log' '--enable-cache-digests' '--enable-delay-pools' '--enable-referer-log' '--enable-arp-acl' '--with-large-files' '--with-filedescriptors=8192' '--enable-caps' '--disable-snmp' '--enable-ssl' '--disable-icap-client' '--enable-http-violations' '--with-pthreads' '--with-aio' '--enable-storeio=ufs,diskd,aufs,null' '--enable-linux-netfilter' '--enable-epoll' 'build_alias=i686-pc-linux-gnu' 'host_alias=i686-pc-linux-gnu' 'CC=i686-pc-linux-gnu-gcc' 'CFLAGS=-march=pentium4m -O2 -pipe -fomit-frame-pointer' 'LDFLAGS=-Wl,-O1' 'CXXFLAGS=-march=pentium4m -O2 -pipe -fomit-frame-pointer'


 From squid.conf:

<snip>




Okay, before reading further:

Please don't take any of the following personally. I have no idea who configured the Squid. Or what company policy restraints they were working under. I do know that some policies and external websites do force extreme measures.

I make the following statements with three hats on:
* an Internet citizen who wants websites to load reliably with the right and current content shown * a webmaster who spends considerable time working to make clients dynamic websites cacheable and efficient. (thus the angst if it shows too thick) * a squid developer who spends considerable time trying to make Squid do things properly according to the HTTP protocol RFC and helping people leverage that for faster networks.





acl dynamic_content urlpath_regex -i \.(asp|aspx|php|pl|xml|rss|kml|cgi|py|pyc) #(\?.*)?$

Hmm, any URL containing a "#" at the end. Weird thing to be looking for.

NP: The '#' sign is never sent in transmitted URLs. It's an internal tag private to the browser. When some data needs to use that sign it is required to always be URL-encoded for transmission.

acl dynamic_content urlpath_regex -i http://audio*pandora.com/*.mp*

That pattern is broken on so many levels I can't even describe them in less than a page of text. Suffice to say...

It only matches things like:
   http://example.com?urlpath=http://audipandoraZcomZm
or
  http://example.com?urlpath=http://audiooopandoraZcomZmpAIUEHB78GWa

Since...

 '*' means the previous _one_ symbol repeated zero or more times.
      example.com/?http://audiooooooopandora.com///////.mppppppp

 '.' means any symbol at all.
     example.com/?http://audipandoraZcomZm



acl dynamic_content urlpath_regex -i cgi-bin
cache deny dynamic_content

Well, lets say that once upon a time whole decades go in another century that was recommended by the developers. Since 2.7 and 3.0 came out it is not.

Of course, with the things refresh_pattern is doing, I'd hate to be a customer who gets anything from this proxies cache.

cache allow all
refresh_pattern -i kh*.google.com/? 43200 80% 259200 ignore-no-cache ignore-private ignore-no-store ignore-auth override-expire override-lastmod ignore-reload refresh_pattern -i virtualearth.net/? 43200 80% 259200 ignore-no-cache ignore-private ignore-no-store ignore-auth override-expire override-lastmod ignore-reload

Meh. Well, yes, some websites do force radical measures due to their design.

refresh_pattern application/* 43200 80% 259200 ignore-no-cache ignore-private ignore-no-store ignore-auth refresh_pattern audio/* 43200 80% 259200 ignore-no-cache ignore-private ignore-no-store ignore-auth

I've never seen a website that uses application/ and audio/ in their folder paths. But if your users ever visit one, the pages will be stored for 6 months.

That _may_ catch some java WAR websites which expose the ~/application/name/pages.html path bits. But I would think most of those are hiding behind apache and doing path re-writing.

refresh_pattern images/* 10080 16% 259200 ignore-no-cache ignore-private ignore-no-store ignore-auth override-expire override-lastmod

Any website which uses the standard technique of placing common images into a shared folder:
For example:
  http://example.com/images/spacer.gif

NP: The irony here is that _these_ images are almost guaranteed to have correct long-term cacheability information attached by the originating web server.

refresh_pattern text/* 0 16% 259200 refresh-ims
refresh_pattern video/* 43200 80% 259200 ignore-no-cache ignore-private ignore-no-store ignore-auth

All URLs containing a folder called video/ or text/.
For example:
  http://example.com/video/index.html
  http://example.com/plaintext/index.html

refresh_pattern . 0 80% 259200 ignore-no-cache ignore-private ignore-no-store ignore-auth

So... _everything_ that is not already stored for 6 months ... gets stored for 6 months unless clients explicitly send flush requests with Ctrl+Reload.

Regardless of what the original website is designed for!!!

Be it some a captcha security image, someones bank account details, or a picture of their kitten.

And you are doing this on a transparent proxy.... Pretty much a textbook example of information leak via man-in-middle attack.


reply_header_access Pragma deny all
reply_header_access Cache-Control deny all

?? force browsers and downstream caches to think they can store anything and everything?
Careful. This is generally not a good idea.


The effect _overall_ is that most dynamic content passes straight through the proxy and gets cached however the client browser wants to cache it (because you stripped the expiry and privacy information). The rest of the content will be stored in your Squid for very long periods and clients who request new updated data will be sent the old version and told it has not changed.

There will be some overlap in websites which generate static content at shorter intervals (ie facebook, and mailing list archives) from which your clients never seem to get the new versions in a timely manner. Only the rather broken ones which serve static content through very inefficient dynamic re-processors will look right all the time.


deny_info about:blank blocked_sites

oooh nasty. You get a lot of phone calls about the Internet being "down" with no explanation?


Amos
--
Please be using
  Current Stable Squid 2.7.STABLE7 or 3.0.STABLE21
  Current Beta Squid 3.1.0.15

[Index of Archives]     [Linux Audio Users]     [Samba]     [Big List of Linux Books]     [Linux USB]     [Yosemite News]

  Powered by Linux