Re: what is the fastest possible thru-speed conf?

Adrian Chadd <adrian@xxxxxxxxxxxxxxx> · Tue, 11 Jul 2006 15:41:13 +0800

The disk access method has a rather huge influence on throughput.
There's been plenty of papers written over the last ten years relating
to disk access patterns and a number of them are specific to web
caching.

Its a well-known fact that using the unix filesystem in a one-file-per-object
method is generally inefficient - there's >1 operation per create/write,
open/read and unlink. If a web cache is populated with a 'normal' webcache
distribution then you'll find the majority of cache objects (~95% in my
live caches at work) are under 64k in size. Many (~50% I think, I'd have
to go through my notes) are under 32k in size.

So it boils down to a few things:

* arranging the disk writes in a way to cut back on the amount of seeking
  during disk writes
* arranging the disk writes in a way to cut back on the amount of seeking
  during disk reads
* handling replacement policies in an efficient way - eg, you don't want
  to have high levels of fragmentation as time goes on as this may impact
  on your ability to batch disk writes
* disk throughput is a function of how you lay out the disk writes and
  how you queue disk reads - and disks are smokingly fast when you're
  able to do big reads and writes with minimal seeks.

Now, the Squid UFS method of laying out things is inefficient because:

* Read/Write/Unlink operations involve more than one disk IO in some cases,
* Modern UNIX FSes have this habit of using synchronous journalling
  of metadata - which also slows things down unless you're careful
  (BSD Softupdates UFS doesn't do this as a specific counter-example)
* There's no way to optimise disk reads/write patterns by influencing the
  on-disk layout - as an example, UNIX FSes tend to 'group' files in a
  directory close together on the disk (same cylinder group in the case
  of BSD FFS) but squid doesn't put files from the same site - or even
  downloaded from the same client at any given time, in the same directory.
* Sites are split up between disks - which may have an influence on
  scheduling reads from hits (ie, if you look at a webpage and it has
  40 objects that are spread across 5 disks, that one client is going
  to issue disk requests to all /five/ disks rather than the more
  optimal idea of stuffing those objects sequentially on the one disk
  and reading them all at once.)
* it all boils down to too much disk seeking!

Now, just as an random data point. I'm able to pull 3 megabytes a second of
random-read hits (~200 hits a second) from a single COSS disk. The
disk isn't running anywhere near capacity even with this inefficient
read pattern. This is from a SATA disk with no tagged-queueing.
The main problem with COSS (besides the bugs :) is that the write rate
is a function of both the data you're storing from server replies
(cachable data) and the "hits" which result in objects being relocated.
Higher request rate == high write rate (storing read objects on disk),
high hit rate == higher write rate (storing read and relocated objects
on disk.)

This one disk system smokes a similar setup AUFS/DISKD on XFS and EXT3.
No, I don't have exact figures - I'm doing this for fun rather than
a graduate/honours project with a paper in mind. But even Duane's
COSS polygraph results from a few years ago show COSS is quite noticably
faster than AUFS/DISKD.

The papers I read from 1998-2002 were talking about obtaining random
reads from disks at a read rate of ~500 objects a second (being <64k in size.)
Thats per disk. In 1998. :)

So, its getting done in my spare time. And it'll turn a Squid server
into something comparable to the commercial caches from 2001 :)
(Ie, ~2400 req/sec with the polygraph workloads with whatever offered
hit rate closely matched.) I can only imagine what they're able to
achieve today with such tightly-optimised codebases.

Adrian

On Tue, Jul 11, 2006, H wrote:

> Hi
> I am not so sure if the particular data access method is what makes the 
> difference. Most real cases are bound to disk or other hardware limitations. 
> Even if often discussed IDE/ATA disks do not come close to SCSI disk 
> throughput in multi user environments. Standard PCs are having often exactly 
> this limit of 2-5MB/s Rick says and you can do what you want there is nothing 
> more. I believe that squid, when coming to the limit simple do not cache 
> anymore and goes directly, means the cache server certainly runs useless on 
> the edge and not caching.
> With good hardware, not necessarily server MBs, you can get 30MB/s as you say 
> but I am not sure how much of this 30MB/s is cache data, do you get 5% or 
> less from disk?
> We have some high bandwidth networks where we use squid on the main server as 
> non-caching server. And then several parents where the cache-to-disk process 
> is done. The main server seems to be bound only to the OS-pps limit (no disk 
> access) and we get up to 90MB/s  through it. The parent caches are queried by 
> content type or object size. Of course the connection between this servers is 
> GBit full duplex. We get this way up to 20% less bandwidth utilization. Times 
> ago we got up to 40% but since emule and other ptp are very popular things 
> are not so good anymore.
> What we use are FreeBSD servers 6.1-Stable version with squid14 as transparent 
> proxy on AMD64 dual-opterons on the main servers and AMD64-X2 machines on the 
> parent caches, all with SCSI-320 and very good and lots of memory. Main 
> server 16GB up and the parents 4GB. Best experience and performance for 
> standard hardware I got with Epox MB and AMD-X2 4400 or 4800. I run more than 
> one squid process on each SMP server.
> 
> Hans
> 
> 
> 
> 
> 
> 
> 
> A mensagem foi scaneada pelo sistema de e-mail e pode ser considerada segura.
> Service fornecido pelo Datacenter Matik  https://datacenter.matik.com.br