Re: RAID50, despite chunk setting, does everything in 4KB blocks

NeilBrown <neilb@xxxxxxx> · Tue, 20 Dec 2011 11:08:06 +1100

On Mon, 19 Dec 2011 16:56:16 -0700 Chris Worley <worleys@xxxxxxxxx> wrote:

> On Mon, Dec 19, 2011 at 4:24 PM, NeilBrown <neilb@xxxxxxx> wrote:
> > On Mon, 19 Dec 2011 15:43:13 -0700 Chris Worley <worleys@xxxxxxxxx> wrote:
> >
> >> It doesn't really matter what chunk sizes I set, but, for example, I
> >> create three RAID5's of 5 drives each with a chunk size of 32K, and
> >> create a RAID0 comprised of the three RAID5's with a chunk size of
> >> 64K:
> >>
> >> md0 : active raid0 md27[2] md26[1] md25[0]
> >>       1885098048 blocks super 1.2 64k chunks
> >>
> >> If I write to one of the RAID5's, using:
> >>
> >> # dd of=/dev/md27  if=/dev/zero bs=1024k oflag=direct
> >>
> >> ... then "iostat -dmx 2" shows the drives being written to in 32K
> >> chunks (avgrq-sz=64), as you'd expect.
> >>
> >> But, writing to the RAID0 that's striping the RAID5's, shows
> >> everything being written in 4KB chunks (iostat shows avgrq-sz=8) to
> >> the RAID0 as well as to the RAID5's.
> >
> > When writing to a RAID5 it *always* submits request to the lower layers in
> > PAGE sized units.  This makes it much easier to keep parity and data aligned.
> >
> > The queue on the underlying device should sort the requests and  group them
> > together and your evidence suggests that it does.
> >
> > When writing to the RAID5 through a RAID0 it will only see 64K at a time but
> > that shouldn't won't make any difference to its behaviour and should change
> > the way the requests finally get to the device.
> >
> > So I have no idea why you see a difference.
> >
> > I suspect lots of block-layer tracing, and lots of staring at code and lots
> > of head scratching would be needed to understand what is really going in.
> 
> Note that "max_segments" for the raid0 = 1, and max_segment_size =
> 4096, which tells Linux that the md can only take a single 4KB page
> per IO request.

Ah, of course.  RAID5 sets a merge_bvec_fn so that there is some chance that
read requests can bypass the cache.
As RAID0 doesn't honour the merge_bvec_fn (maybe it should) it sets the max
request size to 1 page.

RAID10 sets a merge_bvec_fn too so RAID0 will be sending it requests in
1-page pieces.

> 
> The scheduler shouldn't be involved in the transaction between the
> RAID0 and RAID5, as neither uses the scheduler, so it shouldn't merge
> there, but it also shouldn't be fragmenting.
> 
> Not having the RAID0 send the larger chunks to the RAID5's may cause
> more fragmentation than the drive's scheduler will be able to
> re-merge.

How hard can it be to merge a few (thousand) requests??? :-)

NeilBrown

Attachment:
signature.asc

Description: PGP signature