> Hi, > > I'm wondering about ways to measure the optimum size for a cache, in terms > of the "value" you gain from each GB of cache space. If you've got a > 400GB > cache and only 99% of your hits come from the first 350GB, there's > probably > no point looking for a larger cache. If only 80% come from the first > 350GB, then a bigger cache might well be useful. > Squid suffers from a little bit of an anochronism in the way it stores object. the classic ufs systems essentially use round-robin and hash to determine storage location for each object separately. This works wonders on ensuring no clashes, but not so good for retrieval optimization. Adrian Chadd has done a lot of study and some work in this area particularly for Squid-2.6/2.7. His paper for FreeBSD conference is a good read on how disk storage relates to Squid. http://www.squid-cache.org/~adrian/talks/20081007%20-%20NYCBSDCON%20-%20Disk%20IO.pdf > I realise there are rules of thumb for cache size, it would be interesting > to be able to analyse a particular squid installation. > Feel free. We would be interested in any improvements you can come up with. > Squid obviously removes objects from its cache based on the chosen > cache_replacement_policy. It appears from the comments in squid.conf that > in the case of the LRU policy, this is implemented as a list, presumably a > queue of pointers to objects in the cache. Objects which come to the head > of the queue are presumably next for removal. I guess if an object in the > cache gets used it goes back to the tail of the queue. I suppose this > process must involve linearly traversing the queue to find the object and > remove it, which is presumably why heap-based policies are available. IIRC there is a doubly-linked list with tail pointer for LRU. > > I wonder if it would be feasible to calculate a "cache rank", which > indicates the position an object was within the queue at the time of the > hit. So, perhaps 0 means at the tail of the queue, 1 means at the head. > If this could be reported alongside each hit in the access.log, one could > draw stats on the amount of hits served by each portion of the queue and > therefore determine the value of expanding or contracting your cache. > > In the case of simple LRU, if the queue must be traversed to find each > element and requeue it (perhaps this isn't the case?), I suppose one could > count the position in the queue and divide by the total length. Yes, same big problems with that in LRU as displaying all objects in the cache ( >1 million is not uncommon cache sizes) and regex purges. > > With a heap, things are more complex. I guess you could give an > indication > of the depth in the heap but there would be so many objects on the lowest > levels, I don't suppose this would be a great guide. Is there some better > value available, such as the key used in the heap maybe? There is fileno or hashed value rather than URL. You still have the same issues of traversal though. > > Or perhaps the whole idea is flawed somehow? > > Comments, criticisms, explanations, rebukes all welcome. > Gavin > If you want to investigate. I'll gently nudge you towards Squid-3 where the rest of the development is going on and improvements have the best chance of survival. For further discussion you may want to bring this up in squid-dev where the developers hang out. Amos