Re: Duplicate files, content distribution networks

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Thu, 14 Jun 2012 22:33:38 +1200

On 14/06/2012 8:53 p.m., Jack Bates wrote:
On 18/05/12 05:55 AM, Eliezer Croitoru wrote:
On 18/05/2012 10:33, Jack Bates wrote:
Are there any resources in Squid core or in the Squid community to help
cache duplicate files? Squid is very useful for building content
distribution networks, but how does Squid handle duplicate files from
content distribution networks when it is used as a forward proxy?

This is important to us because many download sites present users 
with a
simple download button that doesn't always send them to the same 
mirror.
Some users are redirected to mirrors that are already cached while 
other
users are redirected to mirrors that aren't. We use a caching proxy 
in a
rural village here in Rwanda to improve internet access, but users 
often
can't predict whether a download will take seconds, or hours, which is
frustrating

How does Squid handle files distributed from mirrors? Do you know of 
any
resources concerning forward proxies and download mirrors?
squid 2.7 has the store_url_rewrite option that does what you need.
sourceforge is one nice example for a cdn files download based mirrors.
and you can always use the cache_peer option to use the main squid as a
more updated version and to use only for what you need such as specific
domain from the older version.

Thanks very much for pointing out the store_url_rewrite option 
Eliezer. Does it require the proxy administrator to manually configure 
the list of download mirrors?

Does anyone in the Squid community have thoughts on exploiting 
Metalink [1] to address caching duplicate files from content 
distribution networks?

The approach I am pursuing is to exploit RFC 6249, Metalink/HTTP: 
Mirrors and Hashes. Given a response with a "Location: ..." header and 
at least one "Link: <...>; rel=duplicate" header, the proxy looks up 
the URLs in these headers in the cache. If the "Location: ..." URL 
isn't already cached but a "Link: <...>; rel=duplicate" URL is, then 
the proxy rewrites the "Location: ..." header with the cached URL. 
This should redirect clients to a mirror that is already cached

Thoughts?

Well, since our very own Henrik Nordstrom is one of the authors. I'd say 
there were thoughts about it in the Squid community :-)

Another idea is to exploit RFC 3230, Instance Digests in HTTP. Given a 
response with a "Location: ..." header and a "Digest: ..." header, if 
the "Location: ..." URL isn't already cached then the proxy checks the 
cache for content with a matching digest and rewrites the "Location: 
..." header with the cached URL if found

I am working on a proof of concept plugin for Apache Traffic Server as 
part of the Google Summer of Code. The code is up on GitHub [2]

If this is a reasonable approach, would it be difficult to build 
something similar for Squid?

Please contact Alex Rousskov at measurement-factory.com, he was 
organising a project to develop Digest handling and de-duplication this 
a while back.

Amos