Re: Opinions sought on best storage type for FreeBSD

Adrian Chadd <adrian@xxxxxxxxxxxxxxx> · Mon, 13 Aug 2007 06:33:54 +0800

On Sat, Aug 11, 2007, Michel Santos wrote:

> > * don't write everything cachable to disk! only write stuff that has
> >   a good chance of being read again;
> 
> there is a "good chance" beeing hit by a car when sleeping in the middle
> of a highway as well there is a chance not beeing hit at all :)

:)

> well that was my knowledge about chances but here are not so many options,
> or you are a hell of forseer or you create an algorithm, kind of inverting
> the usage of the actual or other cache policies applying them before
> caching the objects instead of controlling the replacement and aging

No, you run two seperate LRUs, glued to each other. One LRU is for new
objects that are coming in, another LRU is for objects which have been
accessed more than once.

The interesting trace to do on a production cache (remember, I don't have
access to production ISP caches and haven't for quite a while) is to calculate
the chance of seeing a subsequent request HIT after a certain period of time.
You want to know what % of your objects that are being requested are never
going to be seen again, where "never" can be some time length (say, a day;
you can vary it in your experiment.)

The idea here is that a large % of your requests are once off and won't
be seen again; what you want to do is keep those in memory in case they're
seen again and not write them to disk. You only want to write stuff to
disk that has a higher chance of being requested later.

It also may make your memory cache more efficient as you're not so worried
about pushing hot objects out of the way to make room for transient bursts
of new, never seen again objects. As an example, ZFS does this for its
memory cache management to prevent a "find /" type workload killing the
hot page cache set.

So you're not predicting the future. :) This is all mostly conjecture on
my part, but the papers from the early 2000's list these techniques as
giving noticable returns.

> interesting project because caching is not so hot anymore, bandwidth is
> cheap in comparism to 10 years ago and the heck today is PtP so I mean,
> probably hard to find a sponsor with good money. The most wanted feature
> is proxying and acl but not cache so I guess even if there are ever geeks
> like us which simply like the challenge to get a bit more out of it most
> people do not know what this is about and do not feel nor see the
> difference between ufs and coss or whatever. To be realistic I understand
> that nobody cares about diskd as nobody cares really about coss because it
> would be only for you or for me and some more and so Henrik works on aufs
> because he likes it but at the end it is also only for him and some
> others. And this sum of some do not have money to spend it into
> coss/aufs/diskd. And probably it is not worth it when the principal users
> have a 8Mb/s adsl for 40 bucks why they should spend money on squid's fs
> development?

A few reasons:

* I want to do P2P caching; who wants to pony up the money for open source
  P2P caching, and why haven't any of the universities done it yet?

* bandwidth is still not free - if Squid can save you 30% of your HTTP
  traffic and your HTTP traffic is (say) 50% of 100mbit, thats 30% of
  30mbit, so 10mbit? That 10mbit might cost you $500 a month in America,
  sure, but over a year? Commodity hardware -can- and -will- effectively
  cache 100mbit for under a couple thousand dollars with the right
  software, and if done right the administration will be low to none
  existant. You'd pay for the cache inside 6 months without having to try;
  if the lifespan is longer than 12 months then its free money.

* Now take the above to where its $300 a megabit (Australia), or even more
  in developing nations..

* .. or, how about in offices who want to provide access control and filtering
  of their 10-100mbit internet link ..

* .. etc.

There's plenty of examples where web caching is still important - and I'm
just touching the forward caching; lets not even talk about how important
reverse caching is in content delivery these days! - and there's still a
lot of room for Squid to grow. Doubling the request rate and tripling the
disk throughput for small objects is entirely within reason, even on just
one CPU. Don't get me started on the possibilities of media/P2P caching
on something large like a Sun thumper, with 20-plus SATA disks in a box.

Would you like Squid to handle 100mbit+ of HTTP traffic on a desktop PC
with a couple SATA disks? Would you like Squid to handle 500-800mbit of
HTTP traffic on a ~$5k server with some SAS disks? This stuff is possible
on today's hardware. We know how to do it; its just a question of 
writing the right software.

Adrian