Re: HTTP Cache General Question

Mark Schall <schallm2@xxxxxxx> · Fri, 9 Oct 2009 09:33:18 -0400

We understand that as the problem.  We are trying to address the issue
by using some form of REST like api so that the URIs depict a specific
chunk of a file.  Of course we want to make that chunk small enough to
be cached but big enough to not bring down performance.

What are current issue is trying to think of the following:

Peer 1 sends HTTP Request to Peer 2 with
"www.tracker.com/someuniqueidentifierforchunkoffile" in the header.
Would Squid or other Web Caches try to contact www.tracker.com instead
of the Peer 2, or will it forward the request onward to Peer 2.

We have a (what i think is) a great solution for the later, but the
solution for the other is not as "elegant"

Thank you again for all your input, I really appreciate it.

Mark Schall
Michigan State University
CSE Graduate Student

On Fri, Oct 9, 2009 at 1:26 AM, Amos Jeffries <squid3@xxxxxxxxxxxxx> wrote:
> CC'ing squid-dev so the other developers can get a look and comment
>
> Mark Schall wrote:
>>
>> Thank you for the information.
>>
>> One more question:
>>
>> We're looking at researching if it is possible to cache P2P data in an
>> HTTP Cache (purely research).  What we have assumed is that if we were
>> to send an HTTP request to an IP address (a diff peer) (1.2.3.4) and
>> the header would have a URI that does not correlate with the IP
>> address that the Web Cache would store based on the URI in the header.
>>  This way if we sent to a diff peer (5.6.7.8) with the same URI in the
>> header, we'd get back the cached data.
>>
>> I know this big assumption, and would change our approach if not true,
>> but it seems logical to be able to work this way.  Do you know if
>> Squid works in this way?
>
> I've given this a small amount of thought over the last few years. And
> bounced the idea off Adrian after hours at a conference last year.
>
> We came to the conclusion that it would be a very difficult thing to do in
> Squid as the code currently stands.
>
> It is theoretically possible and relatively easy to add a P2P port and an
> engine to handle the requests arriving. Also to cache the objects similar to
> any others, URI can be provided as you say, or even created as needed
> directly out of P2P meta data in the .torrent case.
>
> The major blocker problem is that Squid cache storage does not yet support
> partial ranges of objects. This is a big problem for HTTP and becomes a
> critical issue if P2P downloads are added. It means essentially that the
> segments of P2P files cannot be fetched in parallel from multiple sources.
> Breaking the best benefits P2P would bring to Squid. The P2P files could
> still be fetched linearly by Squid however.
>
> A lesser major issue is the sheer size of P2P objects and traffic. Caches
> are already filled with a lot of content from HTTP alone, adding P2P
> requests to that would add a major burden on storage space. This is more an
> obstacle for the admin however.
>
> Beyond that there is a lot of small pieces of work to make Squid capable of
> contacting P2P servers and peers, intercept seed file requests, etc.
>
>
>>
>> We also think it'd be possible for the cache to take the HTTP header,
>> check to see if the URI is in the cache, and if not send the header to
>> the domain in the header.
>
> This is how proxies already work at a fundamental level.
>
>
>>
>> Thank you again
>>
>> Mark Schall
>> Michigan State University
>> CSE Graduate Student
>>
>>
>>
>> On Wed, Oct 7, 2009 at 11:17 PM, Amos Jeffries <squid3@xxxxxxxxxxxxx>
>> wrote:
>>>
>>> On Wed, 7 Oct 2009 11:24:29 -0400, Mark Schall <schallm2@xxxxxxx> wrote:
>>>>
>>>> Hi,
>>>>
>>>> My name is Mark Schall.  I am a Master student at Michigan State
>>>> University.  I am working with a group, trying to work with HTTP
>>>> caches.  We were wondering if, in general do HTTP caches work by
>>>> caching data based on the IP addresses or by the URI of the HTTP
>>>
>>> In general? I wont dare to guess. Too many ways to do it and too many
>>> different software caches using those ways.
>>>
>>> Squid in particular stores them by hash. Older versions used hash of URL.
>>> Newer 2.x use Hash of URL + some Vary: headers and stuff.
>>>
>>>> request.  It seems that using IP addresses would be the most secure
>>>> means of caching, but the URI seems logical for multiple server
>>>> websites.
>>>
>>> I assume by 'secure' you mean 'secure against data leaks'. There is
>>> nothing
>>> inherently secure about caching in the first place. The cache admin
>>> always
>>> has access to the cached data in intermediary traffic.
>>> What security there is in caching is built on a trust between cache admin
>>> and website admin. The website admin trusts that the cache admin will
>>> obey
>>> the CC headers. The cache admin trusts that the website admin will set
>>> the
>>> headers correctly (private) to protect sensitive information and also
>>> inform how often objects get replaced etc.
>
> Amos
> --
> Please be using
>  Current Stable Squid 2.7.STABLE7 or 3.0.STABLE19
>  Current Beta Squid 3.1.0.14
>