On 11/08/2009 01:42 AM, Beolach wrote: > On Sat, Nov 7, 2009 at 11:35, Doug Ledford <dledford@xxxxxxxxxx> wrote: >> On 11/04/2009 01:40 PM, Leslie Rhorer wrote: >>> I would recommend a larger chunk size. I'm using 256K, and even >>> 512K or 1024K probably would not be excessive. >> >> OK, I've got some data that I'm not quite ready to send out yet, but it >> maps out the relationship between max_sectors_kb (largest request size a >> disk can process, which varies based upon scsi host adapter in question, >> but for SATA adapters is capped at and defaults to 512KB max per >> request) and chunk size for a raid0 array across 4 disks or 5 disks (I >> could run other array sizes too, and that's part of what I'm waiting on >> before sending the data out). The point here being that a raid0 array >> will show up more of the md/lower layer block device interactions where >> as raid5/6 would muddy the waters with other stuff. The results of the >> tests I ran were pretty conclusive that the sweet spot for chunk size is >> when chunk size is == max_sectors_kb, and since SATA is the predominant >> thing today and it defaults to 512K, that gives a 512K chunk as the >> sweet spot. Given that the chunk size is generally about optimizing >> block device operations at the command/queue level, it should transfer >> directly to raid5/6 as well. >> > > This only really applies for large sequential io loads, right? I seem > to recall > smaller chunk sizes being more effective for smaller random io loads. Actually, no. Small chunk sizes don't help with truly random I/O. Assuming the I/O is truly random, your layout doesn't really matter because no matter how you lay stuff out, your still going to get random I/O to each drive. The only real reason to use small chunk sizes in the past, and this reason is no longer true today, was to stream I/O across all the platters simultaneously on even modest size sequential I/O in order to be able to get the speed of all drives combined as your maximum I/O speed. This has always, and still does, hurt random I/O performance. But, back in the day when disks only did 5 to 10MB/s of throughput, and the computer could do hundreds, it made a certain amount of sense. Now we can do hundreds per disk, and it doesn't. Since the sequential performance of even a single disk is probably good enough in most cases, it's far preferable to optimize your array for random I/O. Well, that means optimizing for seeks. In any given array, your maximum number of operations is equal to the maximum number of seeks that can be performed (since with small random I/O you generally don't saturate the bandwidth). So, every time a command spans a chunk from one disk to the next, that single command consumes one of the possible seeks on both disks. In order to optimize for seeks, you need to reduce the number of seeks per command in your array as much as possible, and that means at least attempting to keep each read/write, whether random or sequential, on a single disk for as long as possible. This gives the highest probability that any given command will complete while only accessing a single disk. If that command completes while only touching a single disk, all the other disks in your array are free to complete other commands simultaneously. So, in an optimal raid array for random I/O, you want all of your disks handling a complete command at a time so that your disks are effectively running in parallel. When commands regularly span across chunks to other disks you gain speed for a specific command at the expense of consuming multiple seeks. Past testing has shown that this effect will produce increased performance under random I/O even with chunk sizes going up to 4096k. However, we reached a point of diminishing returns somewhere around 256k. It seems that as soon as you reach a chunk size equal (or roughly equal) to the max command size for a drive, then it doesn't do much better to go any higher (less than 1% improvement for huge jumps in chunk size). We didn't fully analyse it, but I would venture to guess that most likely the maximum command size is large enough that the read ahead code in the filesystem might grab up to one commands worth of data in a short enough period of time for it to be considered sequential by the disk, but would take long enough before grabbing the next sequential chunk that other intervening reads/writes would have essentially made these two sequential operations have an intervening seek and therefore be as random themselves (speaking of one large sequential I/O in the middle of a bunch of random I/O). -- Doug Ledford <dledford@xxxxxxxxxx> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband
Attachment:
signature.asc
Description: OpenPGP digital signature