Re: Full stripe write in RAID6

NeilBrown <neilb@xxxxxxx> · Tue, 19 Aug 2014 16:54:30 +1000

On Mon, 18 Aug 2014 21:25:25 +0530 "Mandar Joshi"
<mandar.joshi@xxxxxxxxxxxxxx> wrote:

> Thanks Neil for the reply...
> Comments inline...
> 
> -----Original Message-----
> From: NeilBrown [mailto:neilb@xxxxxxx] 
> Sent: Wednesday, August 06, 2014 12:17 PM
> To: Mandar Joshi
> Cc: linux-raid@xxxxxxxxxxxxxxx
> Subject: Re: Full stripe write in RAID6
> 
> On Tue, 5 Aug 2014 21:55:46 +0530 "Mandar Joshi"
> <mandar.joshi@xxxxxxxxxxxxxx> wrote:
> 
> > Hi,
> >                 If I am writing entire stripe then whether RAID6 md 
> > driver need to read any of the blocks from underlying device?
> >                 
> >                 I have created RAID6 device with default (512K) chunk 
> > size with total 6 RAID devices. cat 
> > /sys/block/md127/queue/optimal_io_size =
> > 2097152 I believe this is full stripe (512K * 4 data disks). 
> > If I write 2MB data, I am expected to dirty entire stripe hence what I 
> > believe I need not require to read either any of the data block or 
> > parity blocks. Thus avoiding RAID6 penalties. Whether md/raid driver 
> > supports full stripe writes by avoiding RAID 6 penalties?
> > 
> > I also expected 6 disks will receive 512K writes each. (4 data disk + 
> > 2 parity disks).
> 
> Your expectation is correct in theory, but it doesn't always quite work like that in practice.
> The write request will arrive at the raid6 driver in smaller chunks and it doesn't always decide correctly whether it should wait for more writes to arrive, or if it should start reading now.
> 
> It would certainly be good to "fix" the scheduling in raid5/raid6, but no one have worked out how yet.
> 
> NeilBrown
> 
> [Mandar] Tuning sysfs/.../md/stripe_cache_size=32768 significantly lowered pre-reads as discussed above. As it does not force queue for completion, stripe handling gets time to dirty next entire full stripes, thus avoiding pre-reads. Still some of the stripe were not lucky to experience the same.
> Further tuning sysfs/.../md/preread_bypass_threshold=stripe_cache_size i.e. 32768, almost eliminated pre-reads in my case.
> Neil mentioned that "raid6 driver gets write request in smaller chunks."
> Also correct if my understanding below is wrong.
> Is it because md/raid driver does not have its own io scheduler which can merge requests? Can we not have io scheduler for md?

md/raid5 does have a scheduler and does merge requests.
It is quite unlike the IO scheduler for a SCSI (or similar) device because
its goal is quite different.  The scheduler merges requests into a stripe
rather than into a sequence because that is what benefits raid5.
raid5 sends single-page requests down to the underlying driver and expects it
to merge them into multi-page requests if it would benefit from that.

The problem is that the raid5 scheduler isn't very clever and gets it wrong
sometimes.

> When I do any buffered write request on md/raid6, I always get multiple 4K requests. I think in absence of io scheduler, its because of Buffered IO writes (from the page-cache) will always be in one-page units?

Yes.

> Due to this reason, whether md/raid6 driver was designed in such way that its internal stripe handling considers stripe = 4K * noOfDisks? 

Because that is easier.

> Why design does not consider internal stripe = chunk_size * noOfDisks?

That would either be very complex, or would require all IO to be in full
chunks which is no ideal for small random IO.

> I think it will help file systems which can do submit_bio with larger size(?)
> Is there any config-setting or patch to improve on in this case?

No - apart from the config settings you have already found.

NeilBrown

> In case of direct IO, pages will be accumulated and then given to md/raid6, hence md/raid6 can receive more than 4K requests.
> But again here, with direct io I could not get a write request more than chunk size of it. Any specific reason?
> 
> 
> 
> > 
> > If I do IO directly on block device /dev/md127, I do observe reads 
> > happening on md device and underlying raid devices as well.
> > 
> > #mdstat o/p:
> > md127 : active raid6 sdah1[5] sdai1[4] sdaj1[3] sdcg1[2] sdch1[1] 
> > sdci1[0]
> >       41926656 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] 
> > [UUUUUU]
> > 
> > 
> > 
> > # time (dd if=/dev/zero of=/dev/md127 bs=2M count=1 && sync)
> > 
> > # iostat::
> > Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> > sdaj1            19.80         1.60       205.20          8       1026
> > sdai1            18.20         0.00       205.20          0       1026
> > sdah1            33.60        11.20       344.40         56       1722
> > sdcg1            20.20         0.00       205.20          0       1026
> > sdci1            31.00         3.20       344.40         16       1722
> > sdch1            34.00       120.00       205.20        600       1026
> > md127           119.20       134.40       819.20        672       4096
> > 
> > 
> > So to avoid cache effect, if any (?) I am using raw device to perform IO.
> > Then for one stripe write I do observe no reads happening. 
> > At the same time I also see few disks getting more writes than 
> > expected. Did not get why?
> > 
> > # raw -qa
> > /dev/raw/raw1:  bound to major 9, minor 127
> > 
> > #time (dd if=/dev/zero of=/dev/raw/raw1 bs=2M count=1 && sync)
> > 
> > # iostat shows:
> > Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> > sdaj1             7.00         0.00       205.20          0       1026
> > sdai1             6.20         0.00       205.20          0       1026
> > sdah1             9.80         0.00       246.80          0       1234
> > sdcg1             6.80         0.00       205.20          0       1026
> > sdci1             9.60         0.00       246.80          0       1234
> > sdch1             6.80         0.00       205.20          0       1026
> > md127             0.80         0.00       819.20          0       4096
> > 
> > I assume if I perform writes in multiples of “optimal_io_size” I would 
> > be doing full stripe writes thus avoiding reads. But unfortunately 
> > with two 2M writes, I do see reads happening for some these drives. 
> > Same case for
> > count=4 or 6 (equal to data disks or total disks).
> > # time (dd if=/dev/zero of=/dev/raw/raw1 bs=2M count=2 && sync)
> > 
> > 
> > Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> > sdaj1            13.40       204.80       410.00       1024       2050
> > sdai1            11.20         0.00       410.00          0       2050
> > sdah1            15.80         0.00       464.40          0       2322
> > sdcg1            13.20       204.80       410.00       1024       2050
> > sdci1            16.60         0.00       464.40          0       2322
> > sdch1            12.40       192.00       410.00        960       2050
> > md127             1.60         0.00      1638.40          0       8192
> > 
> > 
> > I read about “/sys/block/md127/md/md/preread_bypass_threshold”. 
> > I tried setting this to 0 as well as suggested somewhere. But no help.
> > 
> > I believe RAID6 penalties will exist if it’s a random write, but in 
> > case of seq. write, whether they will still exist in some other form 
> > in Linux md/raid driver?
> > My aim is to maximize RAID6 Write IO rate with sequential Writes 
> > without
> > RAID6 penalties.
> > 
> > Rectify me wherever my assumptions are wrong. Let me know if any other 
> > configuration param (for block device or md device) is required to 
> > achieve the same.
> > 
> > --
> > Mandar Joshi
> 

Attachment:
signature.asc

Description: PGP signature