Full stripe write in RAID6

"Mandar Joshi" <mandar.joshi@xxxxxxxxxxxxxx> · Tue, 5 Aug 2014 21:55:46 +0530

Hi,
                If I am writing entire stripe then whether RAID6 md driver
need to read any of the blocks from underlying device? 

                I have created RAID6 device with default (512K) chunk size
with total 6 RAID devices. cat /sys/block/md127/queue/optimal_io_size =
2097152 I believe this is full stripe (512K * 4 data disks). 
If I write 2MB data, I am expected to dirty entire stripe hence what I
believe I need not require to read either any of the data block or parity
blocks. Thus avoiding RAID6 penalties. Whether md/raid driver supports full
stripe writes by avoiding RAID 6 penalties?

I also expected 6 disks will receive 512K writes each. (4 data disk + 2
parity disks). 

If I do IO directly on block device /dev/md127, I do observe reads happening
on md device and underlying raid devices as well. 

#mdstat o/p:
md127 : active raid6 sdah1[5] sdai1[4] sdaj1[3] sdcg1[2] sdch1[1] sdci1[0]
      41926656 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6]
[UUUUUU]

# time (dd if=/dev/zero of=/dev/md127 bs=2M count=1 && sync)

# iostat::
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdaj1            19.80         1.60       205.20          8       1026
sdai1            18.20         0.00       205.20          0       1026
sdah1            33.60        11.20       344.40         56       1722
sdcg1            20.20         0.00       205.20          0       1026
sdci1            31.00         3.20       344.40         16       1722
sdch1            34.00       120.00       205.20        600       1026
md127           119.20       134.40       819.20        672       4096

So to avoid cache effect, if any (?) I am using raw device to perform IO.
Then for one stripe write I do observe no reads happening. 
At the same time I also see few disks getting more writes than expected. Did
not get why?

# raw -qa
/dev/raw/raw1:  bound to major 9, minor 127

#time (dd if=/dev/zero of=/dev/raw/raw1 bs=2M count=1 && sync)

# iostat shows:
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdaj1             7.00         0.00       205.20          0       1026
sdai1             6.20         0.00       205.20          0       1026
sdah1             9.80         0.00       246.80          0       1234
sdcg1             6.80         0.00       205.20          0       1026
sdci1             9.60         0.00       246.80          0       1234
sdch1             6.80         0.00       205.20          0       1026
md127             0.80         0.00       819.20          0       4096

I assume if I perform writes in multiples of ?optimal_io_size? I would be
doing full stripe writes thus avoiding reads. But unfortunately with two 2M
writes, I do see reads happening for some these drives. Same case for
count=4 or 6 (equal to data disks or total disks).
# time (dd if=/dev/zero of=/dev/raw/raw1 bs=2M count=2 && sync)

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdaj1            13.40       204.80       410.00       1024       2050
sdai1            11.20         0.00       410.00          0       2050
sdah1            15.80         0.00       464.40          0       2322
sdcg1            13.20       204.80       410.00       1024       2050
sdci1            16.60         0.00       464.40          0       2322
sdch1            12.40       192.00       410.00        960       2050
md127             1.60         0.00      1638.40          0       8192

I read about ?/sys/block/md127/md/md/preread_bypass_threshold?. 
I tried setting this to 0 as well as suggested somewhere. But no help.

I believe RAID6 penalties will exist if it?s a random write, but in case of
seq. write, whether they will still exist in some other form in Linux
md/raid driver?
My aim is to maximize RAID6 Write IO rate with sequential Writes without
RAID6 penalties.

Rectify me wherever my assumptions are wrong. Let me know if any other
configuration param (for block device or md device) is required to achieve
the same.

--
Mandar Joshi
ÿôèº{.nÇ+?·?®??+%?Ëÿ±éÝ¶¥?wÿº{.nÇ+?·¥?{±þ¶¢wø§¶?¡Ü¨}©?²Æ zÚ&j:+v?¨þø¯ù®w¥þ?à2?Þ?¨èÚ&¢)ß¡«a¶Úÿÿûàz¿äz¹Þ?ú+?ù???Ý¢jÿ?wèþf