Oh, and since '-p f4' works so well, it really seems like there is a bug in the 'near' code. We are going to see if we can find anything in the code. I could see that mechanical drives get an advantage with 'far', but SSDs should make little difference. RAID10 f4 # fio -rw=read --size=5G --name=mdadm_test ... Disk stats (read/write): md15: ios=45212/5, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=14064/13, aggrmerge=0/0, aggrticks=290590/893, aggrin_queue=291481, aggrutil=98.95% loop23: ios=15328/13, merge=0/0, ticks=337884/928, in_queue=338816, util=98.95% loop21: ios=15329/13, merge=0/0, ticks=314396/984, in_queue=315372, util=98.75% loop24: ios=12800/13, merge=0/0, ticks=270368/904, in_queue=271268, util=98.59% loop22: ios=12800/13, merge=0/0, ticks=239712/756, in_queue=240468, util=98.51% # fio -rw=randread --size=5G --name=mdadm_test ... Disk stats (read/write): md15: ios=1305867/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=327680/0, aggrmerge=0/0, aggrticks=21163/0, aggrin_queue=21146, aggrutil=23.32% loop23: ios=327680/0, merge=0/0, ticks=21512/0, in_queue=21496, util=23.32% loop21: ios=327680/0, merge=0/0, ticks=20716/0, in_queue=20692, util=22.44% loop24: ios=327680/0, merge=0/0, ticks=21500/0, in_queue=21488, util=23.31% loop22: ios=327680/0, merge=0/0, ticks=20924/0, in_queue=20908, util=22.68% ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Nov 2, 2016 at 3:27 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > hmmm.... > > RAID1 > root@rleblanc-pc:~/junk# fio -rw=read --size=1G --numjobs=4 > --name=mdadm_test --group_reporting > mdadm_test: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1 > ... > fio-2.10 > Starting 4 processes > mdadm_test: Laying out IO file(s) (1 file(s) / 1024MB) > mdadm_test: Laying out IO file(s) (1 file(s) / 1024MB) > mdadm_test: Laying out IO file(s) (1 file(s) / 1024MB) > Jobs: 1 (f=1): [R(1),_(3)] [88.9% done] [423.8MB/0KB/0KB /s] [108K/0/0 > iops] [eta 00m:01s] > mdadm_test: (groupid=0, jobs=4): err= 0: pid=20564: Wed Nov 2 15:15:40 2016 > read : io=4096.0MB, bw=567642KB/s, iops=141910, runt= 7389msec > clat (usec): min=0, max=22233, avg=23.02, stdev=288.38 > lat (usec): min=0, max=22233, avg=23.12, stdev=288.38 > clat percentiles (usec): > | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 1], > | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 2], > | 70.00th=[ 2], 80.00th=[ 2], 90.00th=[ 2], 95.00th=[ 3], > | 99.00th=[ 644], 99.50th=[ 1144], 99.90th=[ 4128], 99.95th=[ 5600], > | 99.99th=[11584] > bw (KB /s): min=94396, max=469418, per=28.62%, avg=162451.40, stdev=81106.83 > lat (usec) : 2=58.15%, 4=39.21%, 10=0.87%, 20=0.09%, 50=0.16% > lat (usec) : 100=0.13%, 250=0.14%, 500=0.13%, 750=0.26%, 1000=0.29% > lat (msec) : 2=0.29%, 4=0.20%, 10=0.09%, 20=0.01%, 50=0.01% > cpu : usr=4.14%, sys=10.87%, ctx=15564, majf=0, minf=41 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=1048576/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > READ: io=4096.0MB, aggrb=567641KB/s, minb=567641KB/s, > maxb=567641KB/s, mint=7389msec, maxt=7389msec > > Disk stats (read/write): > md13: ios=48375/3, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, > aggrios=12292/6, aggrmerge=0/0, aggrticks=31009/140, > aggrin_queue=31145, aggrutil=97.41% > loop1: ios=14654/6, merge=0/0, ticks=39524/156, in_queue=39672, util=97.41% > loop4: ios=5791/6, merge=0/0, ticks=13976/100, in_queue=14072, util=45.45% > loop2: ios=16575/6, merge=0/0, ticks=37360/152, in_queue=37508, util=90.92% > loop3: ios=12150/6, merge=0/0, ticks=33176/152, in_queue=33328, util=91.08% > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > nvme0n1 0.50 1387.00 3234.00 2996.50 388746.00 17500.00 > 130.41 4.44 0.71 1.29 0.09 0.16 98.40 > loop1 0.00 0.00 1510.00 2.50 128839.75 6.50 > 170.38 5.10 3.37 3.34 24.80 0.66 100.00 > loop2 0.00 0.00 1570.00 2.50 133952.25 6.50 > 170.38 5.22 3.31 3.27 25.60 0.64 100.00 > loop3 0.00 0.00 1521.50 2.50 129855.75 6.50 > 170.42 5.00 3.27 3.24 25.60 0.65 98.60 > loop4 0.00 0.00 2.50 2.50 248.00 6.50 > 101.80 0.04 8.40 1.60 15.20 8.00 4.00 > loop5 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > md13 0.00 0.00 4603.50 1.50 392832.00 6.00 > 170.61 0.00 0.00 0.00 0.00 0.00 0.00 > > root@rleblanc-pc:~/junk# fio -rw=randread --size=1G --numjobs=4 > --name=mdadm_test --group_reporting > mdadm_test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1 > ... > fio-2.10 > Starting 4 processes > Jobs: 1 (f=1): [_(3),r(1)] [100.0% done] [35996KB/0KB/0KB /s] > [8999/0/0 iops] [eta 00m:00s] > mdadm_test: (groupid=0, jobs=4): err= 0: pid=21036: Wed Nov 2 15:17:47 2016 > read : io=4096.0MB, bw=133254KB/s, iops=33313, runt= 31476msec > clat (usec): min=4, max=14896, avg=103.19, stdev=123.06 > lat (usec): min=4, max=14896, avg=103.27, stdev=123.06 > clat percentiles (usec): > | 1.00th=[ 7], 5.00th=[ 9], 10.00th=[ 11], 20.00th=[ 90], > | 30.00th=[ 95], 40.00th=[ 99], 50.00th=[ 104], 60.00th=[ 112], > | 70.00th=[ 118], 80.00th=[ 125], 90.00th=[ 141], 95.00th=[ 167], > | 99.00th=[ 247], 99.50th=[ 318], 99.90th=[ 2256], 99.95th=[ 2512], > | 99.99th=[ 4256] > bw (KB /s): min=26472, max=57008, per=28.80%, avg=38380.41, stdev=7929.82 > lat (usec) : 10=6.96%, 20=10.26%, 50=1.27%, 100=22.67%, 250=57.86% > lat (usec) : 500=0.68%, 750=0.04%, 1000=0.02% > lat (msec) : 2=0.09%, 4=0.12%, 10=0.01%, 20=0.01% > cpu : usr=1.51%, sys=7.30%, ctx=1051111, majf=0, minf=38 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=1048576/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > READ: io=4096.0MB, aggrb=133254KB/s, minb=133254KB/s, > maxb=133254KB/s, mint=31476msec, maxt=31476msec > > Disk stats (read/write): > md13: ios=1047839/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, > aggrios=262144/0, aggrmerge=0/0, aggrticks=25507/0, > aggrin_queue=25490, aggrutil=92.98% > loop1: ios=342845/0, merge=0/0, ticks=29440/0, in_queue=29424, util=92.98% > loop4: ios=190900/0, merge=0/0, ticks=20568/0, in_queue=20552, util=65.09% > loop2: ios=257401/0, merge=0/0, ticks=26512/0, in_queue=26492, util=83.65% > loop3: ios=257430/0, merge=0/0, ticks=25508/0, in_queue=25492, util=80.67% > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > nvme0n1 0.00 0.00 34484.50 0.00 141398.00 0.00 > 8.20 3.02 0.09 0.09 0.00 0.03 100.00 > loop11 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > loop12 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > loop13 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > loop14 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > loop15 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > md14 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > RAID10 > root@rleblanc-pc:~/junk# fio -rw=read --size=1G --numjobs=4 > --name=mdadm_test --group_reporting > ... > Disk stats (read/write): > md14: ios=36295/19, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, > aggrios=9227/27, aggrmerge=0/0, aggrticks=274586/1967, > aggrin_queue=276552, aggrutil=98.05% > loop13: ios=9006/27, merge=0/0, ticks=253296/1824, in_queue=255120, util=95.31% > loop11: ios=9171/27, merge=0/0, ticks=260884/1876, in_queue=262760, util=96.57% > loop14: ios=9593/27, merge=0/0, ticks=313672/2256, in_queue=315924, util=98.05% > loop12: ios=9141/27, merge=0/0, ticks=270492/1912, in_queue=272404, util=97.20% > > root@rleblanc-pc:~/junk# fio -rw=randread --size=1G --numjobs=4 > --name=mdadm_test --group_reporting > ... > Disk stats (read/write): > md14: ios=1047470/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, > aggrios=262144/0, aggrmerge=0/0, aggrticks=33242/0, > aggrin_queue=33209, aggrutil=92.62% > loop13: ios=258512/0, merge=0/0, ticks=33188/0, in_queue=33160, util=90.21% > loop11: ios=275798/0, merge=0/0, ticks=34120/0, in_queue=34088, util=92.62% > loop14: ios=252031/0, merge=0/0, ticks=31976/0, in_queue=31936, util=87.15% > loop12: ios=262235/0, merge=0/0, ticks=33684/0, in_queue=33652, util=91.52% > > Much better distribution, especially on RAID10. I wonder if because we > are running a single VM on the array that libvirt is basically single > threaded causing what we are seeing. I think libvirt can have multiple > threads for I/O, we'll have to look into that. It is obvious that md > can split reads from a single thread, I wonder what is preventing from > allowing it to do it more efficiently. > > This warrants more probing. > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Wed, Nov 2, 2016 at 3:00 PM, Andreas Klauer > <Andreas.Klauer@xxxxxxxxxxxxxx> wrote: >> On Wed, Nov 02, 2016 at 01:56:02PM -0600, Robert LeBlanc wrote: >>> Yes, we can have any number of disks in a RAID1 (we currently have >>> three), but reads only ever come from the first drive. >> >> Only if there's only one reader. So it depends on what activity >> there is on the machine. >> >>> We just need the option to grow a RAID10 like we can with RAID1. >> >> Patches welcome, I'm sure? ;-) >> >>> Basically, we want to be super paranoid with several identical copies >>> of the data and get extra read performance. >> >> You could put RAID on RAID and thus achieve other modes but not sure >> if it's worth the overhead or even applies in any way to your use case >> and using non standard setups always comes with its own pitfalls. >> >> RAID 1, with RAID0 on top, three disks ABC, two partitions ab, >> different disk order. >> >> A B C >> a 1 2 3 >> b 3 1 2 >> >> Three RAID 1 md1, md2, md3, (and md0 a RAID-0 on top). >> >> You can grow it. >> >> A B C D >> a 1 2 3 ? >> b 3 1 2 ? >> >> A B C D >> a 1 2 3 ? >> b 3 1 2 3 >> >> md3 has 3 disks temporarily here. >> >> A B C D >> a 1 2 3 4 >> b 4 1 2 3 >> >> md4 is new, to be added to md0. >> >> Three copies? Same thing with three partitions. >> >> Will it help any or make things worse? I dunno. >> Have to be careful to make md0 assemble last. >> >> Could also be RAID5 on top instead of RAID1. >> That's even stranger though. >> >> Regards >> Andreas Klauer -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html