On Mon, 19 Dec 2011 16:56:16 -0700 Chris Worley <worleys@xxxxxxxxx> wrote: > On Mon, Dec 19, 2011 at 4:24 PM, NeilBrown <neilb@xxxxxxx> wrote: > > On Mon, 19 Dec 2011 15:43:13 -0700 Chris Worley <worleys@xxxxxxxxx> wrote: > > > >> It doesn't really matter what chunk sizes I set, but, for example, I > >> create three RAID5's of 5 drives each with a chunk size of 32K, and > >> create a RAID0 comprised of the three RAID5's with a chunk size of > >> 64K: > >> > >> md0 : active raid0 md27[2] md26[1] md25[0] > >> 1885098048 blocks super 1.2 64k chunks > >> > >> If I write to one of the RAID5's, using: > >> > >> # dd of=/dev/md27 if=/dev/zero bs=1024k oflag=direct > >> > >> ... then "iostat -dmx 2" shows the drives being written to in 32K > >> chunks (avgrq-sz=64), as you'd expect. > >> > >> But, writing to the RAID0 that's striping the RAID5's, shows > >> everything being written in 4KB chunks (iostat shows avgrq-sz=8) to > >> the RAID0 as well as to the RAID5's. > > > > When writing to a RAID5 it *always* submits request to the lower layers in > > PAGE sized units. This makes it much easier to keep parity and data aligned. > > > > The queue on the underlying device should sort the requests and group them > > together and your evidence suggests that it does. > > > > When writing to the RAID5 through a RAID0 it will only see 64K at a time but > > that shouldn't won't make any difference to its behaviour and should change > > the way the requests finally get to the device. > > > > So I have no idea why you see a difference. > > > > I suspect lots of block-layer tracing, and lots of staring at code and lots > > of head scratching would be needed to understand what is really going in. > > Note that "max_segments" for the raid0 = 1, and max_segment_size = > 4096, which tells Linux that the md can only take a single 4KB page > per IO request. Ah, of course. RAID5 sets a merge_bvec_fn so that there is some chance that read requests can bypass the cache. As RAID0 doesn't honour the merge_bvec_fn (maybe it should) it sets the max request size to 1 page. RAID10 sets a merge_bvec_fn too so RAID0 will be sending it requests in 1-page pieces. > > The scheduler shouldn't be involved in the transaction between the > RAID0 and RAID5, as neither uses the scheduler, so it shouldn't merge > there, but it also shouldn't be fragmenting. > > Not having the RAID0 send the larger chunks to the RAID5's may cause > more fragmentation than the drive's scheduler will be able to > re-merge. How hard can it be to merge a few (thousand) requests??? :-) NeilBrown
Attachment:
signature.asc
Description: PGP signature