On Tue, Mar 11, 2014 at 9:43 PM, Alex Rousskov <rousskov@xxxxxxxxxxxxxxxxxxxxxxx> wrote: > On 03/11/2014 01:18 PM, Nikolai Gorchilov wrote: >> On Tue, Mar 11, 2014 at 6:10 PM, Alex Rousskov wrote: >>> On 03/11/2014 08:05 AM, Omid Kosari wrote: >>>> Is it possible for Squid to automatically find every similar object based on >>>> something like md5 of objects and serve them to clients without need custom >>>> DB ? > > >>> No, because clients do not tell Squid what checksum they are looking >>> for. > >>> It is possible to avoid caching duplicate content, but that allows you >>> to handle cache hits more efficiently. It does not help with cache >>> misses (when the URL requested by the client has not been seen before). > > >> Actually, two commercial vendors - PeerApp and ThunderCache - claim >> their products doesn't use urls to identify the objects, thus they >> don't have to maintain StoreID-like de-duplication database manually. >> >> Any ideas how do they do it? > > Most likely they do not, and you are simply being mislead by their > marketing claims. In general, it is not possible to ignore the request I also suspected it is just marketing. But wanted to check if I miss something :) > URL and still produce the right response (think about it!). They > probably do not store duplicate cache objects, but, as discussed above, > that is far from the "automatic StoreID" functionality that the original > poster is asking about. > > In other words, there are at least two de-duplication layers: > > * The higher-level one is based on URLs and essentially requires manual > URL mapping. It helps turn cache misses into hits. > > * The lower-level one is based on checksums and can be automated. It > helps spend less cache space to serve cache hits. Some commercial > products have implemented this lower-level optimization. I was thinking about this second option some time back. It doesn't seem very complicated and I see clear benefits if implemented in Squid, thus having the best of both worlds. Having lower-level checksum-based deduplication in a combination with some form of feedback mechanizm (logging, helper, etc) can be used by either humans or heruistic algorithms to create/update StoreID patterns. Best, Niki