Re: Automatic StoreID ?

Nikolai Gorchilov <niki@xxxxxxxx> · Fri, 14 Mar 2014 13:16:21 +0200

On Tue, Mar 11, 2014 at 9:43 PM, Alex Rousskov
<rousskov@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
> On 03/11/2014 01:18 PM, Nikolai Gorchilov wrote:
>> On Tue, Mar 11, 2014 at 6:10 PM, Alex Rousskov wrote:
>>> On 03/11/2014 08:05 AM, Omid Kosari wrote:
>>>> Is it possible for Squid to automatically find every similar object based on
>>>> something like md5 of objects and serve them to clients without need custom
>>>> DB ?
>
>
>>> No, because clients do not tell Squid what checksum they are looking
>>> for.
>
>>> It is possible to avoid caching duplicate content, but that allows you
>>> to handle cache hits more efficiently. It does not help with cache
>>> misses (when the URL requested by the client has not been seen before).
>
>
>> Actually, two commercial vendors - PeerApp and ThunderCache - claim
>> their products doesn't use urls to identify the objects, thus they
>> don't have to maintain StoreID-like de-duplication database manually.
>>
>> Any ideas how do they do it?
>
> Most likely they do not, and you are simply being mislead by their
> marketing claims. In general, it is not possible to ignore the request

I also suspected it is just marketing. But wanted to check if I miss
something :)

> URL and still produce the right response (think about it!). They
> probably do not store duplicate cache objects, but, as discussed above,
> that is far from the "automatic StoreID" functionality that the original
> poster is asking about.
>
> In other words, there are at least two de-duplication layers:
>
> * The higher-level one is based on URLs and essentially requires manual
> URL mapping. It helps turn cache misses into hits.
>
> * The lower-level one is based on checksums and can be automated. It
> helps spend less cache space to serve cache hits. Some commercial
> products have implemented this lower-level optimization.

I was thinking about this second option some time back. It doesn't
seem very complicated and I see clear benefits if implemented in
Squid, thus having the best of both worlds.

Having lower-level checksum-based deduplication in a combination with
some form of feedback mechanizm (logging, helper, etc) can be used by
either humans or heruistic algorithms to create/update StoreID
patterns.

Best,
Niki