Re: a miss threshold for certian times of a specified webpages

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Tue, 03 Jul 2012 00:36:04 +1200

On 2/07/2012 6:13 p.m., Mustafa Raji wrote:

--- On Mon, 7/2/12, Amos Jeffries <squid3@xxxxxxxxxxxxx> wrote:

From: Amos Jeffries <squid3@xxxxxxxxxxxxx>
Subject: Re:  a miss threshold for certian times of a specified  webpages
To: "Mustafa Raji" <mustafa.raji@xxxxxxxxx>
Cc: squid-users@xxxxxxxxxxxxxxx
Date: Monday, July 2, 2012, 12:29 AM
On 02.07.2012 10:24, Mustafa Raji
wrote:
--- On Sun, 7/1/12, Amos Jeffries wrote:

From: Amos Jeffries
On 1/07/2012 1:04 a.m., Mustafa Raji
wrote:
hello

is there an option that limits number of
access to
webpage before it can be consider as a cachable and
caches
the webpage
example
some option like a ( miss threshold ) = 30
so the user requests the page for a 30 time
and this
requests of the objects can by consider as a miss
requests,
after the user request reaches this threshold (30),
then
squid can consider this webpage objects as a
cachable
objects and began to cache these objects

Uhm, why are you even considering this?  What
benefit
can you gain by wasting bandwidth and server CPU
time?
HTTP servers send out Cache-Control details
specifying
whether and for how long each object can be cached
for.
Replacing these controls (which are often carefully
chosen
by the webmaster) with arbitrary other algorithms
like the
one you suggest is where all the trouble people
have with
proxies comes from.

Amos

thanks Amos for your reply
what about an option that can consider the first 60
http requests for
google webpage as a miss, and after the 60 requests the
google webpage
can be allowed to be cached, is there any option in
squid to do this,
of course without time limitation

No because HTTP is stateless protocol where each requests
MUST be considered in isolation from every other request.
Squid can handle tens of thousands of URL per second, each
URL being up to 64KB line with multiple letters at each byte
position. Keeping counters for every unique URL received by
Squid over an unlimited time period would be as bad or worse
than simply caching in accordance with HTTP design
requirements.

Which is why I asked; Why do you think this is a good idea?
what are you getting out of it? what possible use would
outweigh all the wasted resources?

NP: the google webpage (any of them including the front
page) changes dynamically, with different displays depending
on user browser headers, Cookies and on Geo-IP based
information. Storing when not told to is a *bad* idea.
Discarding when told storage is possible is a waste of
bandwidth.

Amos

thanks Amos for your helpful support
really i need it for just test, a method to calculate the increasing of squid box hardware (disk space) to get a highly hit ratio,

finding how much hardware i can use to get hit ratio with the calculation of hardware worth to the hit ratio, i hope i was clear in my explanation

Oh. Hit ratio is not something you can test like that. Attempts at 
partial-caching will actively *reduce* it.

simple example/

if i add a hard of 500 gigabyte i can reach a hit ratio 20% this is worth of adding this hardware
if i add a hard of 500 gigabyte i can reach 2% hit ratio it's not worth to add this hardware

if course in this community, there is a good method to calculate that please can you show me how to do that if you have time or just give me a links to a webpage explain how to do that

Hit ratio is the ratio of cacheable to non-cacheable content in your 
HTTP traffic flow. Imagine the cache storage size as a "window" of 
traffic over which this HIT ratio is accumulated. There is some complex 
feedbacks, in that each 1% of HIT increases the actual traffic window 
size by 1% and things like that.

So as you can see, the best way to calculate HIT ratio is to take a 
measure of your users HTTP traffic (Twice as large as the proposed 500GB 
cache size [in case you are lucky enough to get ~50% hit ratio]) and see 
how many of them were repeat requests for the same URL. That count as a 
percentage of the total requests is a rough upper-limit HIT ratio for 
your users. You can guess that a bit of those were non-cacheable, but 
long-term due to that increasing-window effect your request HIT ratio 
will trend around that number.

NOTE: unfortunately I'm not aware of any tools that make this 
calculation easy. If you have an existing Squid, you can do it from the 
historic access.log. Otherwise you are left with tricky TCP dumps etc.

But effectively, the larger your cache storage size, the more HIT 
traffic you can achieve. Nobody I know of is proxying real user traffic 
and getting less than 5% HIT ratio without something (like small or no 
disk cache) limiting the traffic than can be HIT on. Current Squid are 
getting 10%-20% ratios out of the box for ISP that setup a cache and 
leave it without any special attention. The special case mobile networks 
with tuning are achieving up to 55% in one ISP.
 ... and we are constantly doing things to improve cacheability, from 
adjusting Squid cache control algorithms to advocating cache-friendly 
practices by web designers.

Amos