Jason Spegal wrote:
On 1/18/2010 8:55 PM, Amos Jeffries wrote:
On Mon, 18 Jan 2010 13:18:20 -0500, Jason Spegal<jspegal@xxxxxxxxxxx>
wrote:
Alrighty. Did some more research and found a solution to my problem
which leads to another issue.
My problem: I was trying to serve a proxy auto configuration file
(wpad.dat) from an internal webserver (http://wpad/). When the client
down the pipe after squid picked it up the file was served with the mime
type chemical/x-mopac-input. When I went direct to the webserver it
served the correct mime type (which I had forced it to).
Solution: On Gentoo squid is using the /etc/mime.types file to guess the
mime type instead of what the remote webserver is saying the
file is. I
Point 1: Squid does not do that. Does not use mime.types at all.
Content-Type headers are passed through unchanged from what is received
unless administratively changed by header_replace.
Taken from access.log
Before changing mime.types
1263657638.249 0 10.10.122.248 TCP_MEM_HIT/200 670 GET
http://wpad/wpad.dat - NONE/- chemical/x-mopac-input
1263661679.834 0 10.10.122.239 TCP_MEM_HIT/200 670 GET
http://wpad/wpad.dat - NONE/- chemical/x-mopac-input
1263662648.054 9 10.10.122.248 TCP_CLIENT_REFRESH_MISS/200 654 GET
http://wpad/wpad.dat - DIRECT/10.10.122.250 chemical/x-mopac-input
1263662742.482 4 10.10.122.248 TCP_CLIENT_REFRESH_MISS/200 654 GET
http://wpad/wpad.dat - DIRECT/10.10.122.250 chemical/x-mopac-input
1263662752.973 0 10.10.122.248 TCP_IMS_HIT/304 264 GET
http://wpad/wpad.dat - NONE/- chemical/x-mopac-input
1263664740.203 0 10.10.122.248 TCP_MEM_HIT/200 669 GET
http://wpad/wpad.dat - NONE/- chemical/x-mopac-input
After changing mime.types
1263834369.649 1 10.10.122.241 TCP_REFRESH_UNMODIFIED/200 647 GET
http://wpad/wpad.dat - DIRECT/10.10.122.250
application/x-ns-proxy-autoconfig
1263834539.719 0 10.10.122.241 TCP_MEM_HIT/200 657 GET
http://wpad/wpad.dat - NONE/- application/x-ns-proxy-autoconfig
1263834791.576 0 10.10.122.241 TCP_MEM_HIT/200 657 GET
http://wpad/wpad.dat - NONE/- application/x-ns-proxy-autoconfig
1263834822.423 0 10.10.122.241 TCP_MEM_HIT/200 657 GET
http://wpad/wpad.dat - NONE/- application/x-ns-proxy-autoconfig
This log contains what the web server passed Squid. Not what Squid
passed the clients.
Q: Is the WPAD web server on the same box where you are altering mime.types?
I just double checked that (ForceType application/x-ns-proxy-autoconfig)
in my apache vhost config is working correctly. Also apache's mime.types
file is setup correctly for this particular item.
fixed the file which I also noticed has several other issues answering
my other other issue, my is 95% of my data being caught in the catch all
refresh_pattern instead of the mime type ones.
Point 2: Squid does not accept mime types in the refresh_pattern
directive.
This explains a few things.
Are you _sure_ that:
* the PAC file is not cached with old headers from before your changes?
Yes
I can only get Squid to produce the wrong mime type by altering
refresh_pattern to the values you have in your config. With that done
Squid very consistently insists on producing a HIT with the first mime
header received, no matter how they change on the server or what cache
controls are passed to Squid by the server.
* the PAC file is actually being fetched from the web server you are
expecting?
Yes
* this is an official build of Squid?
Yes, see below.
* nobody has applied third-party patches to it?
(none of the official Gentoo patches change mime.types.
http://sources.gentoo.org/viewcvs.py/gentoo-x86/net-proxy/squid/files/)
Fairly sure.
What headers does this produce when run on the Squid box?
squidclient -v -h wpad -p 80 /wpad.dat
I'm posting version and configuration at the bottom of this email.
Refresh patterns will be changed after this email is sent. This is a
standard gentoo install with the epoll USE flag.
[ebuild R ] net-proxy/squid-3.0.19 USE="caps epoll ldap mysql pam
samba sqlite ssl -icap-client (-ipf-transparent) -kerberos -kqueue
-logrotate* -nis (-pf-transparent) -postgres -radius -sasl (-selinux)
-snmp -zero-penalty-hit" 0 kB
Okay. So no reason whatsoever why the mime type is changing.
(squidclient -v -h wpad -p 80 /wpad.dat) yeilds
headers: 'GET /wpad.dat HTTP/1.0
Accept: */*
'
HTTP/1.1 404 Not Found
Date: Tue, 19 Jan 2010 03:27:19 GMT
Server: Apache
Content-Length: 265
Connection: close
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /wpad.dat was not found on this server.</p>
<hr>
<address>Apache Server at localhost Port 80</address>
</body></html>
So I used GET instead.
(GET http://wpad/wpad.dat -USed)
GET http://wpad/wpad.dat
User-Agent: lwp-request/5.827 libwww-perl/5.831
GET http://wpad/wpad.dat --> 200 OK
Connection: close
Date: Tue, 19 Jan 2010 03:28:59 GMT
Accept-Ranges: bytes
Age: 412
ETag: "736a9e-119-47d6be3f06d80"
Server: Apache
Content-Length: 281
Content-Type: application/x-ns-proxy-autoconfig
Last-Modified: Mon, 18 Jan 2010 08:10:46 GMT
Client-Date: Tue, 19 Jan 2010 03:28:59 GMT
Client-Peer: 10.10.122.250:80
Client-Response-Num: 1
That reply appears to have gone through Squid. I'm particularly
interested in the headers going _into_ Squid.
I think try this as well and compare to the above set.
squidclient -v -h wpad -p 80 -j wpad /wpad.dat
Of note for other Gentoo& Debian users: From mime.types # This file is
part of the app-misc/mime-types package, which is based on debian's
"mime-support".
So my question is now; how do I force squid to use the mime-type
delivered by the remote webserver without killing mime.types and thus
breaking my system in new and unexpected ways?
The official releases of Squid pass content-type headers through
unchanged. Something is broken.
On 1/15/2010 8:22 PM, Amos Jeffries wrote:
Jason Spegal wrote:
Is mime.conf what is used by refresh_pattern when mime types are used
for the regex?
No.
refresh_pattern uses a text regex against the requested URL string.
mime.conf is used by FTP and Gopher directory display to show the
icons.
Amos
Squid Cache: Version 3.0.STABLE19
configure options: '--prefix=/usr' '--build=i686-pc-linux-gnu'
'--host=i686-pc-linux-gnu' '--mandir=/usr/share/man'
'--infodir=/usr/share/info' '--datadir=/usr/share' '--sysconfdir=/etc'
'--localstatedir=/var/lib' '--sysconfdir=/etc/squid'
'--libexecdir=/usr/libexec/squid' '--localstatedir=/var'
'--datadir=/usr/share/squid' '--with-default-user=squid'
'--enable-auth=basic,digest,negotiate,ntlm'
'--enable-removal-policies=lru,heap'
'--enable-digest-auth-helpers=password'
'--enable-basic-auth-helpers=DB,PAM,LDAP,SMB,multi-domain-NTLM,getpwnam,NCSA,MSNT'
'--enable-external-acl-helpers=ldap_group,wbinfo_group,ip_user,session,unix_group'
'--enable-ntlm-auth-helpers=SMB,fakeauth'
'--enable-negotiate-auth-helpers=' '--enable-useragent-log'
'--enable-cache-digests' '--enable-delay-pools' '--enable-referer-log'
'--enable-arp-acl' '--with-large-files' '--with-filedescriptors=8192'
'--enable-caps' '--disable-snmp' '--enable-ssl' '--disable-icap-client'
'--enable-http-violations' '--with-pthreads' '--with-aio'
'--enable-storeio=ufs,diskd,aufs,null' '--enable-linux-netfilter'
'--enable-epoll' 'build_alias=i686-pc-linux-gnu'
'host_alias=i686-pc-linux-gnu' 'CC=i686-pc-linux-gnu-gcc'
'CFLAGS=-march=pentium4m -O2 -pipe -fomit-frame-pointer'
'LDFLAGS=-Wl,-O1' 'CXXFLAGS=-march=pentium4m -O2 -pipe
-fomit-frame-pointer'
From squid.conf:
<snip>
Okay, before reading further:
Please don't take any of the following personally. I have no idea who
configured the Squid. Or what company policy restraints they were
working under. I do know that some policies and external websites do
force extreme measures.
I make the following statements with three hats on:
* an Internet citizen who wants websites to load reliably with the
right and current content shown
* a webmaster who spends considerable time working to make clients
dynamic websites cacheable and efficient. (thus the angst if it shows
too thick)
* a squid developer who spends considerable time trying to make Squid
do things properly according to the HTTP protocol RFC and helping people
leverage that for faster networks.
acl dynamic_content urlpath_regex -i
\.(asp|aspx|php|pl|xml|rss|kml|cgi|py|pyc) #(\?.*)?$
Hmm, any URL containing a "#" at the end. Weird thing to be looking for.
NP: The '#' sign is never sent in transmitted URLs. It's an internal tag
private to the browser. When some data needs to use that sign it is
required to always be URL-encoded for transmission.
acl dynamic_content urlpath_regex -i http://audio*pandora.com/*.mp*
That pattern is broken on so many levels I can't even describe them in
less than a page of text. Suffice to say...
It only matches things like:
http://example.com?urlpath=http://audipandoraZcomZm
or
http://example.com?urlpath=http://audiooopandoraZcomZmpAIUEHB78GWa
Since...
'*' means the previous _one_ symbol repeated zero or more times.
example.com/?http://audiooooooopandora.com///////.mppppppp
'.' means any symbol at all.
example.com/?http://audipandoraZcomZm
acl dynamic_content urlpath_regex -i cgi-bin
cache deny dynamic_content
Well, lets say that once upon a time whole decades go in another century
that was recommended by the developers. Since 2.7 and 3.0 came out it is
not.
Of course, with the things refresh_pattern is doing, I'd hate to be a
customer who gets anything from this proxies cache.
cache allow all
refresh_pattern -i kh*.google.com/? 43200 80% 259200 ignore-no-cache
ignore-private ignore-no-store ignore-auth override-expire
override-lastmod ignore-reload
refresh_pattern -i virtualearth.net/? 43200 80% 259200 ignore-no-cache
ignore-private ignore-no-store ignore-auth override-expire
override-lastmod ignore-reload
Meh. Well, yes, some websites do force radical measures due to their design.
refresh_pattern application/* 43200 80% 259200 ignore-no-cache
ignore-private ignore-no-store ignore-auth
refresh_pattern audio/* 43200 80% 259200 ignore-no-cache ignore-private
ignore-no-store ignore-auth
I've never seen a website that uses application/ and audio/ in their
folder paths. But if your users ever visit one, the pages will be stored
for 6 months.
That _may_ catch some java WAR websites which expose the
~/application/name/pages.html path bits. But I would think most of those
are hiding behind apache and doing path re-writing.
refresh_pattern images/* 10080 16% 259200 ignore-no-cache ignore-private
ignore-no-store ignore-auth override-expire override-lastmod
Any website which uses the standard technique of placing common images
into a shared folder:
For example:
http://example.com/images/spacer.gif
NP: The irony here is that _these_ images are almost guaranteed to have
correct long-term cacheability information attached by the originating
web server.
refresh_pattern text/* 0 16% 259200 refresh-ims
refresh_pattern video/* 43200 80% 259200 ignore-no-cache ignore-private
ignore-no-store ignore-auth
All URLs containing a folder called video/ or text/.
For example:
http://example.com/video/index.html
http://example.com/plaintext/index.html
refresh_pattern . 0 80% 259200 ignore-no-cache ignore-private
ignore-no-store ignore-auth
So... _everything_ that is not already stored for 6 months ... gets
stored for 6 months unless clients explicitly send flush requests with
Ctrl+Reload.
Regardless of what the original website is designed for!!!
Be it some a captcha security image, someones bank account details, or a
picture of their kitten.
And you are doing this on a transparent proxy.... Pretty much a textbook
example of information leak via man-in-middle attack.
reply_header_access Pragma deny all
reply_header_access Cache-Control deny all
?? force browsers and downstream caches to think they can store anything
and everything?
Careful. This is generally not a good idea.
The effect _overall_ is that most dynamic content passes straight
through the proxy and gets cached however the client browser wants to
cache it (because you stripped the expiry and privacy information). The
rest of the content will be stored in your Squid for very long periods
and clients who request new updated data will be sent the old version
and told it has not changed.
There will be some overlap in websites which generate static content at
shorter intervals (ie facebook, and mailing list archives) from which
your clients never seem to get the new versions in a timely manner. Only
the rather broken ones which serve static content through very
inefficient dynamic re-processors will look right all the time.
deny_info about:blank blocked_sites
oooh nasty. You get a lot of phone calls about the Internet being "down"
with no explanation?
Amos
--
Please be using
Current Stable Squid 2.7.STABLE7 or 3.0.STABLE21
Current Beta Squid 3.1.0.15