On 5/08/2013 11:14 p.m., babajaga wrote:
Sorry, Amos, not to waste too much time here for an off-topic issue, but
interesting matter anyways:
Okay. I am running out of time and this is slightly old info I'm basing
all this on - so shall we finish up? measurements and testing is kind of
requried to go further and demonstrate anything.
Disclaimer: some of what I "know" and say below may be complete FUD with
modern disks. I have not done any testing since 10-20GB was a widely
available storage device size and SSD layers on drives had not even been
invented. Shop-talk with people doing testing more recently though tells
me that the basics are probably still completely valid even if the
tricks added to solve problems are changing rapidly.
The key take-away should be that Squids disk I/O pattern for small
objects blows most of those new tricks into uselessness.
I ACK your remarks regarding disk controller activity. But, AFAIK, squid
does NOT directly access the disk controller for raw disk I/O, the FS is
always in-between instead. And, that means, that a (lot of) buffering can
occure, before real disk-I/O is done.
This depends on two factors:
1) there is RAM available for the buffering required.
-> The higher the traffic load the less memory is available to the
system for this.
2) The OS has a chance of advance buffering.
-> Objects up to 64KB (often 4KB or 8 KB) can be completely loaded
into Squid I/O buffers in a single read(), and there is no way for the
OS to identify which of the surrounding sectors/blocks are related
objects to the one just loaded (if it guesses and gets it wrong things
go even worse than not guessing at all).
-> Also, remember AUFS is preferred for large (over-32KB) objects -
the ones which will require multiple read()'s - and Rock best for small
(under-32KB) objects. This OS buffering prediction is a significant part
of the reason why.
Which might even lead to spurious high
reponse times, when all of a sudden the OS decides, to really flush large
disk-buffers to disk.
Note that this will result in bursty disk I/O traffic pattern, with
waves of alternating high and low access speeds for disk accesses. The
aim with high performance is to flatten the low-speed troughs out as
much as possible by raising them up to make a constant peak rate of I/O.
In a good file system (or disk controller,
downstream), request-reordering should happen, to allow elevator-style head
movements. Or merging file accesses, referencing the same disk blocks.
Exactly. And this is where Squid being partially *network* I/O event
driven comes into play affecting the disk I/O pattern. Squid is managing
N concurrent connections, each of those is potentially servicing a
distinct *unique* client file fetch (well mostly, and when collapsed
forwarding is ready fo Squid-3 it will be unique). Every I/O loop Squid
cycles through all N in order and schedules a cross-sectional slice for
any which are needing disk read/write. So each I/O cycle Squid delivers
at most one read (HIT/MISS send to client) and one write (MISS received
from server) for any given file, with up to N possibly vastly separate
files on disk being accessed.
The logics doing that elevator calculation are therefore *not* faced
with a single set of file operations in one area. But with a
cross-sectional read/write over potentially the entire disk. At most it
can reorder those into elevator up/down cross section over the disk. But
in passing those completion events back to Squid it triggers another I/O
cycle for Squid over the network sockets, and thus another sweep over
the entire disk pace. Worst-case (and best) the spindle heads are
sweeping the platter from end-to-end reading everything needed
1-cycle:1-sweep.
That is with _one_ cache_dir sitting on the spindle.
Now if you pay close attention to the elevator sweep there is a lot of
time spent scanning between areas of the disk and not so much doing I/O.
To optimize around this effect and allow even more concurrent file reads
Squid is load balancing between cache_dir where it places files. AFAIK
the theory is that one head can be seeking while another is doing its
I/O, for overall effect of having a more steady flow of bytes back to
Squid after the FS software abstraction layer and raising those troughs
again to a smooth flow. Although that said, "theory" is not practice.
Placing both cache_dir on the one disk the FS logics will of course
reorder and interleave the I/O for each cache_dir such the the disk
behaviour is a single sweep as for one cache_dir. BUT, as a result the
seek lag and bursty nature of read() bytes returned is fully exposed to
Squid - by the very mechanisms supposedly minimizing that. In turn this
reflects in the network I/O as bytes are relayed directly there by Squid
and TCP gets a bursty peak/trough pattern appearing.
Additionally, and probably more importantly, that reordering of 2
cache_dir on one disk spindle down to the behaviour of 1 cache_dir caps
the I/O limit for *both* of those cache_dir at the disk I/O threshold
(after optimization). Whereas having them on separate spindles would
allow each to have that full capacity and effectivly double the disk I/O
threshold (after optimization).
Why we say Rock can share with UFS/AUFS/diskd is that the I/O block size
being requested is larger so there are less disk sweep movements even if
many files/blocks are being loaded concurrently. So loading a few
hundred objects in one Rock block of I/O, most of which will then get
memory-HIT speeds, is just as efficient as loading _one_ more file out
of the UFS/AUFS/diskd cache_dir.
And all this should happen after squids activities are completed, but before
the real disk driver/controller starts its work.
BTW, I did some private measurements, not regarding response times because
of various types of cache_dirs, but regarding reponse times/disk thruput
because of various FS and options thereof. And found, that a "crippled" ext4
works best for me. Default journaling etc. in ext4 has a definite hit on
disk-I/O. Giving up some safety features has a drastic positive influence.
Should be valid, for all types of cache_dirs, though.
I hazard a guess that if you go through them those "some" will all be
features which involve doing some form of additional read/write to the
disk for each chunk of written bytes. Yes?
Things such as file access timestamping, journal recording, checksum
writing, checksum validation post-write, dedup block registration, RAID
protection, etc.
The logic behind that guess:
As mentioned above the I/O presented by Squid will be already sliced
across the network I/O streams and just needs reordering for the
"elevator sweep" of quite a large number of base operations. Adding a
second sweep to perform all the followup operations, OR causing the
elevator to jump slightly forward/back to do them mid-sweep (I *hope* no
disks do this anymore), will only harm the presented I/O sweep and slow
down the time before its completion can be notified to Squid. Worst-case
halving the I/O limits the disk can provide to the FS layer let alone
Squid, I imagine that worst-case is rare but some "drastic" amount of
difference is fully expected.
Amos