Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?

"Frantisek Rysanek" <Frantisek.Rysanek@xxxxxxx> · Wed, 08 Apr 2009 16:22:45 +0200

Dear everyone,

first of all apologies for asking such a general question in this 
fairly focused and productive mailing list... any references to other 
places or previous posts are welcome :-)

Recently I've been getting requests for help with optimising the 
"storage back end" on Linux-based servers run by various people that 
I come in contact with.
And, I'm starting to see a "pattern" in those requests for help: 
typically the box in question runs some web-based application, but 
essentially the traffic pattern consists in transferring big files 
back'n'forth. Someone uploads a file, and a number of other people 
later download it. So it must be pretty similar to a busy master 
distributions site of some Linux distro or even www.kernel.org :-) 
A possible difference may be that the capacities served in my case 
are accessed more or less evenly and are in units or tens of TB per 
machine.
The proportion of writes in the total traffic can be 20% or less, 
sometimes much less (less than 1%).

The key point is that many (up to many hundred) server threads
are reading (and writing) the files in parallel in relatively tiny 
snippets. I've read before that the load presented by such multiple 
parallel sessions is "bad" and difficult to handle.
Yet I believe that the sequential nature of the individual file
does suggest some basic ideas for optimization:

1) try to massage the big individual files to be allocated in large 
contiguous chunks on the disk. Prevent fragmentation. That way, an 
individual file can be read "sequentially" = with maximum transfer 
rate in MBps. Clearly this boils down to the choice of FS type and FS 
tweaking, and possibly some application-level optimizations should 
help, if such can be implemented (grow the files in big chunks).

2) try to optimize the granularity of reads for maximum MBps 
throughput. Given the behavior of common disk drives, an optimum 
sequential transfer size is about 4 MB. That should correspond to a 
FS allocation unit size (whatever the precise name - cluster, block 
group etc.) and RAID stripe size, if the block device is really a 
striped RAID device. Next, the runtime OS behavior (read-ahead size) 
should be set to this value, at least very theoretically. And, for  
optimum performance, chunk boundaries at the various layers/levels 
should be aligned. This approach based on heavy read-ahead will 
require lots of free RAM in the VM, but given the typical composition 
of per-thread data flow vs. number of threads, I guess this is 
essentially viable. 

3) try to optimize writes for bigger transaction size. In Linux, it 
takes some tweaking to the VM dirty ratio and (deadline) IO scheduler 
timeouts, but ultimately it's perhaps the only bit that works 
somewhat well. Unfortunately, given the relatively small proportion 
of writes, this optimization has only a relatively small effect in 
the whole volume of traffic. It may come useful if you use RAID 
levels with calculated parity (typically RAID 5 or 6) which reduce 
the IOps available from the spindles when writing, by the number of 
spindles in a stripe set...

Problems:

The FS on-disk allocation granularity and RAID device stripe sizes 
available are typically much smaller than 4 MB. Especially in HW RAID 
controllers, the maximum stripe size is typically limited to maybe 
128 kB, which means a waste of valuable IOps, if you use the RAID to 
store large sequential files. A simple solution, at least for testing 
purposes, is to use the Linux native software MD raid (level 0),
as this RAID implementation accepts big chunk sizes without a problem
(I've just tested 4 MB). And, stripe together several stand-alone
mirror volumes, presented by a hardware RAID. It can be seen as a 
waste of the HW RAID's acceleration unit, but the resulting hybrid 
RAID 10 works very well for a basic demonstration of the other 
issues. There are no outright configuration bottlenecks in such a 
setup, the bottom-level mirrors don't have a fixed stripe size and 
RAID 10 doesn't suffer from the RAID5/6 "parity poison".

Especially the intended read-ahead optimization seems troublesome / 
ineffective. 
I've tried with XFS, which is really the only FS eligible for volume 
sizes over 16 TB, and also said to be well optimized for sequential 
data, aiming at contiguous allocation and all that. Everybody's using 
XFS in such applications.
I don't understand all the tweakable knobs of mkfs.xfs - not well
enough to match the 4MB RAID chunk size somewhere in the internal
structure of XFS.
Another problem is, that there seems to be a single tweakable knob to 
read-ahead in Linux 2.6, accessible in several ways:
  /sys/block/<dev>/queue/max_sectors_kb
  /sbin/blockdev --setra
  /sbin/blockdev --setfra
When speaking about read-ahead optimization, about reading big 
contiguous chunks of data, intrinsically I mean per-file read-ahead 
at the "filesystem payload level". And the key trouble seems to be, 
that the Linux read-ahead takes place at block device level. You ask 
for some piece of data, and your request gets rounded up to 128 kB at 
the block level (and aligned to integer blocks of that size, it would 
seem). 128 kB is the default size. 
As a result, interestingly, if you finish off all the aforementioned 
optimizations by setting the max_sectors_kb to 4096, you don't get 
higher throughput. You do get increased MBps at the block device 
level (as seen in iostat), but your throughput at the level of open 
files actually drops, and the number of threads "blocked in iowait" 
grows.

My explanation for the "read-ahead misbehavior" is, that the data is 
not entirely contiguous on the disk. That the filesystem metadata 
introduces some "out of sequence" reads, which result in reading 
ahead 4Meg chunks of metadata and irrelevant disk space (= useless 
junk). Or, possibly, that the file allocation on disk is not all that 
contiguous. Essentially that somehow the huge read-ahead overlaps 
much too often into irrelevant space. I've even tried to counter this 
effect "in vitro" by preparing a filesystem "tiled" with 1GB files, 
that I created in sequence (one by one) by calling 
posix_fallocate64() on a freshly open file descriptor...
But the reading threads then made the read-ahead misbehave precisely 
that way.

It would be excellent if the read-ahead could happen at the 
"filesystem payload level" and map somehow optimally to block-level 
traffic paterns. Yet, the --setfra (and --getfra) parameters to the 
"blockdev" util in Linux 2.6 seem to map to the block-level read-
ahead size. Is there any other tweakable knob that I'm missing?

Based on some manpages on the madvise() and fadvise() functions, I'd 
say that the level of read-ahead corresponding to MADV_SEQUENTIAL and 
FADV_SEQUENTIAL is still decimal orders less than the desired figure.
Besides, those are syscalls that need to be called by the
application on an open file handle - it may or may not be
easy to implement those selectively enough in your app.
Consider a monster like Apache with PHP and APR bucket brigades at 
the back end... Who's got the wits to implement the "advise"
syscalls in those tarballs?
It would help if you could set this per mountpoint
using some sysfs variable or using an ioctl() call from
some small stand-alone util.

I've dumped all the stuff I could come up with on a web page:
http://www.fccps.cz/download/adv/frr/hdd/hdd.html
It contains some primitive load generators that I'm using.

Do you have any further tips on that?
I'd love to hope that I'm just missing something simple...

Any ideas are welcome :-)

Frank Rysanek

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html