Dear everyone, first of all apologies for asking such a general question in this fairly focused and productive mailing list... any references to other places or previous posts are welcome :-) Recently I've been getting requests for help with optimising the "storage back end" on Linux-based servers run by various people that I come in contact with. And, I'm starting to see a "pattern" in those requests for help: typically the box in question runs some web-based application, but essentially the traffic pattern consists in transferring big files back'n'forth. Someone uploads a file, and a number of other people later download it. So it must be pretty similar to a busy master distributions site of some Linux distro or even www.kernel.org :-) A possible difference may be that the capacities served in my case are accessed more or less evenly and are in units or tens of TB per machine. The proportion of writes in the total traffic can be 20% or less, sometimes much less (less than 1%). The key point is that many (up to many hundred) server threads are reading (and writing) the files in parallel in relatively tiny snippets. I've read before that the load presented by such multiple parallel sessions is "bad" and difficult to handle. Yet I believe that the sequential nature of the individual file does suggest some basic ideas for optimization: 1) try to massage the big individual files to be allocated in large contiguous chunks on the disk. Prevent fragmentation. That way, an individual file can be read "sequentially" = with maximum transfer rate in MBps. Clearly this boils down to the choice of FS type and FS tweaking, and possibly some application-level optimizations should help, if such can be implemented (grow the files in big chunks). 2) try to optimize the granularity of reads for maximum MBps throughput. Given the behavior of common disk drives, an optimum sequential transfer size is about 4 MB. That should correspond to a FS allocation unit size (whatever the precise name - cluster, block group etc.) and RAID stripe size, if the block device is really a striped RAID device. Next, the runtime OS behavior (read-ahead size) should be set to this value, at least very theoretically. And, for optimum performance, chunk boundaries at the various layers/levels should be aligned. This approach based on heavy read-ahead will require lots of free RAM in the VM, but given the typical composition of per-thread data flow vs. number of threads, I guess this is essentially viable. 3) try to optimize writes for bigger transaction size. In Linux, it takes some tweaking to the VM dirty ratio and (deadline) IO scheduler timeouts, but ultimately it's perhaps the only bit that works somewhat well. Unfortunately, given the relatively small proportion of writes, this optimization has only a relatively small effect in the whole volume of traffic. It may come useful if you use RAID levels with calculated parity (typically RAID 5 or 6) which reduce the IOps available from the spindles when writing, by the number of spindles in a stripe set... Problems: The FS on-disk allocation granularity and RAID device stripe sizes available are typically much smaller than 4 MB. Especially in HW RAID controllers, the maximum stripe size is typically limited to maybe 128 kB, which means a waste of valuable IOps, if you use the RAID to store large sequential files. A simple solution, at least for testing purposes, is to use the Linux native software MD raid (level 0), as this RAID implementation accepts big chunk sizes without a problem (I've just tested 4 MB). And, stripe together several stand-alone mirror volumes, presented by a hardware RAID. It can be seen as a waste of the HW RAID's acceleration unit, but the resulting hybrid RAID 10 works very well for a basic demonstration of the other issues. There are no outright configuration bottlenecks in such a setup, the bottom-level mirrors don't have a fixed stripe size and RAID 10 doesn't suffer from the RAID5/6 "parity poison". Especially the intended read-ahead optimization seems troublesome / ineffective. I've tried with XFS, which is really the only FS eligible for volume sizes over 16 TB, and also said to be well optimized for sequential data, aiming at contiguous allocation and all that. Everybody's using XFS in such applications. I don't understand all the tweakable knobs of mkfs.xfs - not well enough to match the 4MB RAID chunk size somewhere in the internal structure of XFS. Another problem is, that there seems to be a single tweakable knob to read-ahead in Linux 2.6, accessible in several ways: /sys/block/<dev>/queue/max_sectors_kb /sbin/blockdev --setra /sbin/blockdev --setfra When speaking about read-ahead optimization, about reading big contiguous chunks of data, intrinsically I mean per-file read-ahead at the "filesystem payload level". And the key trouble seems to be, that the Linux read-ahead takes place at block device level. You ask for some piece of data, and your request gets rounded up to 128 kB at the block level (and aligned to integer blocks of that size, it would seem). 128 kB is the default size. As a result, interestingly, if you finish off all the aforementioned optimizations by setting the max_sectors_kb to 4096, you don't get higher throughput. You do get increased MBps at the block device level (as seen in iostat), but your throughput at the level of open files actually drops, and the number of threads "blocked in iowait" grows. My explanation for the "read-ahead misbehavior" is, that the data is not entirely contiguous on the disk. That the filesystem metadata introduces some "out of sequence" reads, which result in reading ahead 4Meg chunks of metadata and irrelevant disk space (= useless junk). Or, possibly, that the file allocation on disk is not all that contiguous. Essentially that somehow the huge read-ahead overlaps much too often into irrelevant space. I've even tried to counter this effect "in vitro" by preparing a filesystem "tiled" with 1GB files, that I created in sequence (one by one) by calling posix_fallocate64() on a freshly open file descriptor... But the reading threads then made the read-ahead misbehave precisely that way. It would be excellent if the read-ahead could happen at the "filesystem payload level" and map somehow optimally to block-level traffic paterns. Yet, the --setfra (and --getfra) parameters to the "blockdev" util in Linux 2.6 seem to map to the block-level read- ahead size. Is there any other tweakable knob that I'm missing? Based on some manpages on the madvise() and fadvise() functions, I'd say that the level of read-ahead corresponding to MADV_SEQUENTIAL and FADV_SEQUENTIAL is still decimal orders less than the desired figure. Besides, those are syscalls that need to be called by the application on an open file handle - it may or may not be easy to implement those selectively enough in your app. Consider a monster like Apache with PHP and APR bucket brigades at the back end... Who's got the wits to implement the "advise" syscalls in those tarballs? It would help if you could set this per mountpoint using some sysfs variable or using an ioctl() call from some small stand-alone util. I've dumped all the stuff I could come up with on a web page: http://www.fccps.cz/download/adv/frr/hdd/hdd.html It contains some primitive load generators that I'm using. Do you have any further tips on that? I'd love to hope that I'm just missing something simple... Any ideas are welcome :-) Frank Rysanek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html