Re: Issue with growing RAID10

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Wed, 2 Nov 2016 16:07:33 -0600

Oh, and since '-p f4' works so well, it really seems like there is a
bug in the 'near' code. We are going to see if we can find anything in
the code. I could see that mechanical drives get an advantage with
'far', but SSDs should make little difference.

RAID10 f4
# fio -rw=read --size=5G --name=mdadm_test
...
Disk stats (read/write):
   md15: ios=45212/5, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=14064/13, aggrmerge=0/0, aggrticks=290590/893,
aggrin_queue=291481, aggrutil=98.95%
 loop23: ios=15328/13, merge=0/0, ticks=337884/928, in_queue=338816, util=98.95%
 loop21: ios=15329/13, merge=0/0, ticks=314396/984, in_queue=315372, util=98.75%
 loop24: ios=12800/13, merge=0/0, ticks=270368/904, in_queue=271268, util=98.59%
 loop22: ios=12800/13, merge=0/0, ticks=239712/756, in_queue=240468, util=98.51%

# fio -rw=randread --size=5G --name=mdadm_test
...
Disk stats (read/write):
   md15: ios=1305867/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=327680/0, aggrmerge=0/0, aggrticks=21163/0,
aggrin_queue=21146, aggrutil=23.32%
 loop23: ios=327680/0, merge=0/0, ticks=21512/0, in_queue=21496, util=23.32%
 loop21: ios=327680/0, merge=0/0, ticks=20716/0, in_queue=20692, util=22.44%
 loop24: ios=327680/0, merge=0/0, ticks=21500/0, in_queue=21488, util=23.31%
 loop22: ios=327680/0, merge=0/0, ticks=20924/0, in_queue=20908, util=22.68%
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Nov 2, 2016 at 3:27 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> hmmm....
>
> RAID1
> root@rleblanc-pc:~/junk# fio -rw=read --size=1G --numjobs=4
> --name=mdadm_test --group_reporting
> mdadm_test: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
> ...
> fio-2.10
> Starting 4 processes
> mdadm_test: Laying out IO file(s) (1 file(s) / 1024MB)
> mdadm_test: Laying out IO file(s) (1 file(s) / 1024MB)
> mdadm_test: Laying out IO file(s) (1 file(s) / 1024MB)
> Jobs: 1 (f=1): [R(1),_(3)] [88.9% done] [423.8MB/0KB/0KB /s] [108K/0/0
> iops] [eta 00m:01s]
> mdadm_test: (groupid=0, jobs=4): err= 0: pid=20564: Wed Nov  2 15:15:40 2016
>  read : io=4096.0MB, bw=567642KB/s, iops=141910, runt=  7389msec
>    clat (usec): min=0, max=22233, avg=23.02, stdev=288.38
>     lat (usec): min=0, max=22233, avg=23.12, stdev=288.38
>    clat percentiles (usec):
>     |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    1],
>     | 30.00th=[    1], 40.00th=[    1], 50.00th=[    1], 60.00th=[    2],
>     | 70.00th=[    2], 80.00th=[    2], 90.00th=[    2], 95.00th=[    3],
>     | 99.00th=[  644], 99.50th=[ 1144], 99.90th=[ 4128], 99.95th=[ 5600],
>     | 99.99th=[11584]
>    bw (KB  /s): min=94396, max=469418, per=28.62%, avg=162451.40, stdev=81106.83
>    lat (usec) : 2=58.15%, 4=39.21%, 10=0.87%, 20=0.09%, 50=0.16%
>    lat (usec) : 100=0.13%, 250=0.14%, 500=0.13%, 750=0.26%, 1000=0.29%
>    lat (msec) : 2=0.29%, 4=0.20%, 10=0.09%, 20=0.01%, 50=0.01%
>  cpu          : usr=4.14%, sys=10.87%, ctx=15564, majf=0, minf=41
>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     issued    : total=r=1048576/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>     latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>   READ: io=4096.0MB, aggrb=567641KB/s, minb=567641KB/s,
> maxb=567641KB/s, mint=7389msec, maxt=7389msec
>
> Disk stats (read/write):
>    md13: ios=48375/3, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=12292/6, aggrmerge=0/0, aggrticks=31009/140,
> aggrin_queue=31145, aggrutil=97.41%
>  loop1: ios=14654/6, merge=0/0, ticks=39524/156, in_queue=39672, util=97.41%
>  loop4: ios=5791/6, merge=0/0, ticks=13976/100, in_queue=14072, util=45.45%
>  loop2: ios=16575/6, merge=0/0, ticks=37360/152, in_queue=37508, util=90.92%
>  loop3: ios=12150/6, merge=0/0, ticks=33176/152, in_queue=33328, util=91.08%
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> nvme0n1           0.50  1387.00 3234.00 2996.50 388746.00 17500.00
> 130.41     4.44    0.71    1.29    0.09   0.16  98.40
> loop1             0.00     0.00 1510.00    2.50 128839.75     6.50
> 170.38     5.10    3.37    3.34   24.80   0.66 100.00
> loop2             0.00     0.00 1570.00    2.50 133952.25     6.50
> 170.38     5.22    3.31    3.27   25.60   0.64 100.00
> loop3             0.00     0.00 1521.50    2.50 129855.75     6.50
> 170.42     5.00    3.27    3.24   25.60   0.65  98.60
> loop4             0.00     0.00    2.50    2.50   248.00     6.50
> 101.80     0.04    8.40    1.60   15.20   8.00   4.00
> loop5             0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md13              0.00     0.00 4603.50    1.50 392832.00     6.00
> 170.61     0.00    0.00    0.00    0.00   0.00   0.00
>
> root@rleblanc-pc:~/junk# fio -rw=randread --size=1G --numjobs=4
> --name=mdadm_test --group_reporting
> mdadm_test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
> ...
> fio-2.10
> Starting 4 processes
> Jobs: 1 (f=1): [_(3),r(1)] [100.0% done] [35996KB/0KB/0KB /s]
> [8999/0/0 iops] [eta 00m:00s]
> mdadm_test: (groupid=0, jobs=4): err= 0: pid=21036: Wed Nov  2 15:17:47 2016
>  read : io=4096.0MB, bw=133254KB/s, iops=33313, runt= 31476msec
>    clat (usec): min=4, max=14896, avg=103.19, stdev=123.06
>     lat (usec): min=4, max=14896, avg=103.27, stdev=123.06
>    clat percentiles (usec):
>     |  1.00th=[    7],  5.00th=[    9], 10.00th=[   11], 20.00th=[   90],
>     | 30.00th=[   95], 40.00th=[   99], 50.00th=[  104], 60.00th=[  112],
>     | 70.00th=[  118], 80.00th=[  125], 90.00th=[  141], 95.00th=[  167],
>     | 99.00th=[  247], 99.50th=[  318], 99.90th=[ 2256], 99.95th=[ 2512],
>     | 99.99th=[ 4256]
>    bw (KB  /s): min=26472, max=57008, per=28.80%, avg=38380.41, stdev=7929.82
>    lat (usec) : 10=6.96%, 20=10.26%, 50=1.27%, 100=22.67%, 250=57.86%
>    lat (usec) : 500=0.68%, 750=0.04%, 1000=0.02%
>    lat (msec) : 2=0.09%, 4=0.12%, 10=0.01%, 20=0.01%
>  cpu          : usr=1.51%, sys=7.30%, ctx=1051111, majf=0, minf=38
>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     issued    : total=r=1048576/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>     latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>   READ: io=4096.0MB, aggrb=133254KB/s, minb=133254KB/s,
> maxb=133254KB/s, mint=31476msec, maxt=31476msec
>
> Disk stats (read/write):
>    md13: ios=1047839/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=262144/0, aggrmerge=0/0, aggrticks=25507/0,
> aggrin_queue=25490, aggrutil=92.98%
>  loop1: ios=342845/0, merge=0/0, ticks=29440/0, in_queue=29424, util=92.98%
>  loop4: ios=190900/0, merge=0/0, ticks=20568/0, in_queue=20552, util=65.09%
>  loop2: ios=257401/0, merge=0/0, ticks=26512/0, in_queue=26492, util=83.65%
>  loop3: ios=257430/0, merge=0/0, ticks=25508/0, in_queue=25492, util=80.67%
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> nvme0n1           0.00     0.00 34484.50    0.00 141398.00     0.00
>  8.20     3.02    0.09    0.09    0.00   0.03 100.00
> loop11            0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop12            0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop13            0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop14            0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> loop15            0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md14              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>
> RAID10
> root@rleblanc-pc:~/junk# fio -rw=read --size=1G --numjobs=4
> --name=mdadm_test --group_reporting
> ...
> Disk stats (read/write):
>    md14: ios=36295/19, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=9227/27, aggrmerge=0/0, aggrticks=274586/1967,
> aggrin_queue=276552, aggrutil=98.05%
>  loop13: ios=9006/27, merge=0/0, ticks=253296/1824, in_queue=255120, util=95.31%
>  loop11: ios=9171/27, merge=0/0, ticks=260884/1876, in_queue=262760, util=96.57%
>  loop14: ios=9593/27, merge=0/0, ticks=313672/2256, in_queue=315924, util=98.05%
>  loop12: ios=9141/27, merge=0/0, ticks=270492/1912, in_queue=272404, util=97.20%
>
> root@rleblanc-pc:~/junk# fio -rw=randread --size=1G --numjobs=4
> --name=mdadm_test --group_reporting
> ...
> Disk stats (read/write):
>    md14: ios=1047470/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=262144/0, aggrmerge=0/0, aggrticks=33242/0,
> aggrin_queue=33209, aggrutil=92.62%
>  loop13: ios=258512/0, merge=0/0, ticks=33188/0, in_queue=33160, util=90.21%
>  loop11: ios=275798/0, merge=0/0, ticks=34120/0, in_queue=34088, util=92.62%
>  loop14: ios=252031/0, merge=0/0, ticks=31976/0, in_queue=31936, util=87.15%
>  loop12: ios=262235/0, merge=0/0, ticks=33684/0, in_queue=33652, util=91.52%
>
> Much better distribution, especially on RAID10. I wonder if because we
> are running a single VM on the array that libvirt is basically single
> threaded causing what we are seeing. I think libvirt can have multiple
> threads for I/O, we'll have to look into that. It is obvious that md
> can split reads from a single thread, I wonder what is preventing from
> allowing it to do it more efficiently.
>
> This warrants more probing.
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Nov 2, 2016 at 3:00 PM, Andreas Klauer
> <Andreas.Klauer@xxxxxxxxxxxxxx> wrote:
>> On Wed, Nov 02, 2016 at 01:56:02PM -0600, Robert LeBlanc wrote:
>>> Yes, we can have any number of disks in a RAID1 (we currently have
>>> three), but reads only ever come from the first drive.
>>
>> Only if there's only one reader. So it depends on what activity
>> there is on the machine.
>>
>>> We just need the option to grow a RAID10 like we can with RAID1.
>>
>> Patches welcome, I'm sure? ;-)
>>
>>> Basically, we want to be super paranoid with several identical copies
>>> of the data and get extra read performance.
>>
>> You could put RAID on RAID and thus achieve other modes but not sure
>> if it's worth the overhead or even applies in any way to your use case
>> and using non standard setups always comes with its own pitfalls.
>>
>> RAID 1, with RAID0 on top, three disks ABC, two partitions ab,
>> different disk order.
>>
>>   A B C
>> a 1 2 3
>> b 3 1 2
>>
>> Three RAID 1 md1, md2, md3, (and md0 a RAID-0 on top).
>>
>> You can grow it.
>>
>>   A B C D
>> a 1 2 3 ?
>> b 3 1 2 ?
>>
>>   A B C D
>> a 1 2 3 ?
>> b 3 1 2 3
>>
>> md3 has 3 disks temporarily here.
>>
>>   A B C D
>> a 1 2 3 4
>> b 4 1 2 3
>>
>> md4 is new, to be added to md0.
>>
>> Three copies? Same thing with three partitions.
>>
>> Will it help any or make things worse? I dunno.
>> Have to be careful to make md0 assemble last.
>>
>> Could also be RAID5 on top instead of RAID1.
>> That's even stranger though.
>>
>> Regards
>> Andreas Klauer
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html