[ ... READ10 setup on fast system, poor read rates ... ] > It was greatly perplexed to find that I could get 250MB/s > clean when writing, [ ... ] It seems that this is because the intelligent host adapter would signal writes completed immediately, so removing the scheduling from the Linux block IO system, not because the Lonux MD and block IO subsystems handle writes better than reads, as I tried the same on another system with slow dumb host adpter and 4 drives and write performance was not good. > Highly perplexing, and then I remembered some people reporting > that they had to do 'blockdev --setra 65536' to get good > streaming performance in similar circumstances, and indeed this > applied to my case too (when set on the '/dev/md0' device). [ ... ] > I noticed that interrupts/s seemed exactly inversely > proportional to read ahead, with lots of interrupts/s for > small read-head, and few with large read-ahead. > * Most revealingly, when I used values of read ahead which were > powers of 10, the numbers of block/s reported by 'vmstat 1' > was also a multiple of that power of 10. More precisely it seems that the thruput is an exact multiple of readhead and interrupts per second. For example on a single hard disk, reading it 32KiB at a time with a read ahead of 1000 512B sectors: soft# blockdev --getra /dev/hda; vmstat 1 1000 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 1 0 6304 468292 5652 0 0 1300 66 121 39 0 1 96 3 0 0 1 0 6328 468320 5668 0 0 41000 0 422 720 0 19 0 81 0 1 1 0 6264 468348 5688 0 0 41500 0 429 731 0 14 0 86 0 1 1 0 6076 468556 5656 0 0 41500 0 427 731 0 15 0 85 0 1 1 0 6012 468584 5660 0 0 41500 0 428 730 0 19 0 81 0 0 1 0 6460 468112 5660 0 0 41016 0 433 730 0 16 0 84 0 0 0 0 6420 468112 5696 0 0 20 0 114 23 0 0 100 0 0 0 0 0 6420 468112 5696 0 0 0 0 104 9 0 0 100 0 0 0 0 0 6420 468112 5696 0 0 0 0 103 11 0 0 100 0 0 0 0 0 6420 468112 5696 0 0 0 0 104 9 0 0 100 0 0 The 'bi' column is in 1KiB blocks, not 512B sectors. It is an exact multiple of 500 If one looks at the number of ''idle'' interrupts (around 100-110/s) it seems as if there are 410-415 IO interrupts per second and on each exactly 1000 512B sectors are read. Amazing coincidence! > * Quite astonishingly, the Linux block device subsystem does > not do mailboxing/queueing of IO requests, but turns the > read-ahead for the device into a blocking factor, and always > issues read requests to the device driver for a strip of N > blocks, where N is the read-ahead, then waits for completion > of each strip before issuing the next request. Indeed, and I have noticed that the number of interrupts/s per MiB of read rate varies considerably as one changes the read-ahead size, so I started suspecting some really dumb logic. But it does not seem to affect physical block devices that much, as despite it they still seem to be able to issue back-to-back requests to the host adapter, even if excessively quantized; but it seems to affect MD a lot. Another interesting detail is that while real disk devices have queueing parameter and MD device don't: # ls /sys/block/{md0,hda}/queue ls: cannot access /sys/block/md0/queue: No such file or directory /sys/block/hda/queue: iosched max_hw_sectors_kb max_sectors_kb nr_requests read_ahead_kb scheduler As one can set 'blockdev --setra' on an MD device (which should be the same as setting 'queue/read_head_kb' to half the value), and that does have an effect, but then the readhead on all the block devices in the MD array are ignored. Anbother interesting detail is that in my usual setup I get 50MB/s with a read-ahead of 64, 250MB/s with 128, and 50MB/s with 256, which strongly suggests some "resonance" at work between quantization factors. So I started digging around for the obvious: some scheme for quantizing/batching requests in the block IO subsystem, and indeed if one looks at the details of the ''plugging'' logic: http://www.gelato.unsw.edu.au/lxr/source/block/ll_rw_blk.c#L1605 «* generic_unplug_device - fire a request queue * @q: The &struct request_queue in question * * Description: * Linux uses plugging to build bigger requests queues before letting * the device have at them. If a queue is plugged, the I/O scheduler * is still adding and merging requests on the queue. Once the queue * gets unplugged, the request_fn defined for the queue is invoked and * transfers started.» and reads further confessions here: http://www.linuxsymposium.org/proceedings/reprints/Reprint-Axboe-OLS2004.pdf «For the longest time, the Linux block layer has used a technique dubbed plugging to increase IO throughput. In its simplicity, plugging works sort of like the plug in your tub drain—when IO is queued on an initially empty queue, the queue is plugged. Only when someone asks for the completion of some of the queued IO is the plug yanked out, and io is allowed to drain from the queue. So instead of submitting the first immediately to the driver, the block layer allows a small buildup of requests. There’s nothing wrong with the principle of plugging, and it has been shown to work well for a number of workloads.» BTW, this statement is a naive advocacy for a gross impropriety, as it is absolutely only the business of the device-specific part of IO (e.g. the host adapter driver) how to translate logical IO requests into physical IO requests, whether coalescing or even splitting them is good, and there is amazingly little justification for rearranging a stream of logical IO requests at the logical IO level. Conversely, quite properly elevators (another request stream restructuring) apply to physical devices, not to partitions or MD devices, and one can have different elevators on different devices (even if having them different for MD slave devices in most cases is very dubious). «2.6 also contains some additional logic to unplug a given queue once it reaches the point where waiting longer doesn’t make much sense. So where 2.4 will always wait for an explicit unplug, 2.6 can trigger an unplug when one of two conditions are met: 1. The number of queued requests reach a certain limit, q->unplug_thresh. This is device tweak able and defaults to 4. It not only defaults to 4, it is 4, as it is never changed from the default: $ pwd /usr/local/src/linux-2.6.23 $ find * -name '*.[ch]' | xargs egrep unplug_thresh block/elevator.c: if (nrq >= q->unplug_thresh) block/ll_rw_blk.c: q->unplug_thresh = 4; /* hmm */ include/linux/blkdev.h: int unplug_thresh; /* After this many requests */ But more ominously, there is some (allegedly rarely triggered) timeout on unplugging a plugged queue: «2. When the queue has been idle for q-> unplug_delay. Also device tweak able, and defaults to 3 milliseconds. The idea is that once a certain number of requests have accumulated in the queue, it doesn’t make much sense to continue waiting for more—there is already an adequate number available to keep the disk happy. The time limit is really a last resort, and should rarely trigger in real life. Observations on various work loads have verified this. More than a handful or two timer unplugs per minute usually indicates a kernel bug.» So I had a look at how the MD subsystem handles unplugging, because of a terrible suspicion that it does two-level unplugging, and wonder what: http://www.gelato.unsw.edu.au/lxr/source/drivers/md/raid10.c#L599 static void raid10_unplug(struct request_queue *q) { mddev_t *mddev = q->queuedata; unplug_slaves(q->queuedata); md_wakeup_thread(mddev->thread); } Can some MD developer justify the lines above? Can some MD developer also explain why should MD engage in double level request queueing/unplugging at both the MD and slave level? Can some MD developer then give some very good reason why the MD layer should be subject to plugging *at all*? This before I spend a bit of time doing a bit of 'blktrace' work to see how unplugging "helps" MD and perhaps setting 'unplug_thresh' globally to 1 "just for fun" :-). -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html