On Mon, Jun 12 2017, CoolCold wrote: > Hello! > i've started doing testing as proposed by you and found other strange > behavior with _4_ drives ~ 44 iops as well: > mdadm --create --assume-clean -c $((64*1)) -b internal > --bitmap-chunk=$((128*1024)) -n 4 -l 10 /dev/md1 /dev/sde /dev/sdf > /dev/sdg /dev/sdh > > mdstat: > [root@spare-a17484327407661 rovchinnikov]# cat /proc/mdstat > Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] > md1 : active raid10 sdh[3] sdg[2] sdf[1] sde[0] > 3516066176 blocks super 1.2 64K chunks 2 near-copies [4/4] [UUUU] > bitmap: 0/14 pages [0KB], 131072KB chunk > > > fio: > [root@spare-a17484327407661 tests]# fio --runtime 60 randwrite.conf > randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, > iodepth=512 > fio-2.2.8 > Starting 1 process > Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/179KB/0KB /s] [0/44/0 iops] > [eta 00m:00s] > randwrite: (groupid=0, jobs=1): err= 0: pid=35048: Mon Jun 12 06:03:28 2017 > write: io=35728KB, bw=609574B/s, iops=148, runt= 60018msec > slat (usec): min=6, max=3006.6K, avg=6714.32, stdev=33548.39 > clat (usec): min=137, max=14323K, avg=3430245.54, stdev=4822029.01 > lat (msec): min=22, max=14323, avg=3436.96, stdev=4830.87 > clat percentiles (msec): > | 1.00th=[ 40], 5.00th=[ 76], 10.00th=[ 87], 20.00th=[ 115], > | 30.00th=[ 437], 40.00th=[ 510], 50.00th=[ 553], 60.00th=[ 619], > | 70.00th=[ 2376], 80.00th=[11600], 90.00th=[11731], 95.00th=[11863], > | 99.00th=[12387], 99.50th=[13435], 99.90th=[14091], 99.95th=[14222], > | 99.99th=[14353] > bw (KB /s): min= 111, max=14285, per=95.41%, avg=567.70, stdev=1623.95 > lat (usec) : 250=0.01% > lat (msec) : 50=2.02%, 100=12.52%, 250=7.01%, 500=17.02%, 750=30.62% > lat (msec) : 1000=0.12%, 2000=0.50%, >=2000=30.18% > cpu : usr=0.06%, sys=0.34%, ctx=2607, majf=0, minf=30 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.3% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% > issued : total=r=0/w=8932/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=512 > > Run status group 0 (all jobs): > WRITE: io=35728KB, aggrb=595KB/s, minb=595KB/s, maxb=595KB/s, > mint=60018msec, maxt=60018msec > > Disk stats (read/write): > md1: ios=61/8928, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, > aggrios=16/6693, aggrmerge=0/0, aggrticks=13/512251, > aggrin_queue=512265, aggrutil=83.63% > sde: ios=40/6787, merge=0/0, ticks=15/724812, in_queue=724824, util=83.63% > sdf: ios=2/6787, merge=0/0, ticks=5/694057, in_queue=694061, util=82.20% > sdg: ios=24/6599, merge=0/0, ticks=27/154988, in_queue=155022, util=80.72% > sdh: ios=1/6599, merge=0/0, ticks=6/475150, in_queue=475155, util=82.29% > > > > > > comparing to the same drives on RAID5 fio shows ~ 142 iops: > [root@spare-a17484327407661 tests]# fio --runtime 60 randwrite.conf > randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, > iodepth=512 > fio-2.2.8 > Starting 1 process > Jobs: 1 (f=1): [w(1)] [0.0% done] [0KB/571KB/0KB /s] [0/142/0 iops] > [eta 93d:11h:20m:52s] > randwrite: (groupid=0, jobs=1): err= 0: pid=34914: Mon Jun 12 05:59:23 2017 > write: io=41880KB, bw=707115B/s, iops=172, runt= 60648msec > > raid5 created basically the same as for RAID10 > mdadm --create --assume-clean -c $((64*1)) -b internal > --bitmap-chunk=$((128*1024)) -n 4 -l 5 /dev/md1 /dev/sde /dev/sdf > /dev/sdg /dev/sdh > > mdstat output for raid5: > [root@spare-a17484327407661 rovchinnikov]# cat /proc/mdstat > Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] > md1 : active raid5 sdh[3] sdg[2] sdf[1] sde[0] > 5274099264 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU] > bitmap: 7/7 pages [28KB], 131072KB chunk > > > for both cases, the same fio config: > [root@spare-a17484327407661 tests]# cat randwrite.conf > [randwrite] > blocksize=4k > #blocksize=64k > filename=/dev/md1 > #filename=/dev/md2 > readwrite=randwrite > #rwmixread=75 > direct=1 > buffered=0 > ioengine=libaio > iodepth=512 > #numjobs=4 > group_reporting=1 > > from iostat, hard drives are having more requests than md1 (compare > 40-43 on md1 and ~ 60 per device) > [root@spare-a17484327407661 rovchinnikov]# iostat -d -xk 1 /dev/md1 > /dev/sde /dev/sdf /dev/sdg /dev/sdh > Linux 3.10.0-327.el7.x86_64 (spare-a17484327407661.sgdc) > 06/12/2017 _x86_64_ (40 CPU) > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > sde 0.00 0.00 0.00 59.00 0.00 236.00 > 8.00 0.76 12.76 0.00 12.76 12.75 75.20 > sdh 0.00 0.00 0.00 59.00 0.00 236.00 > 8.00 0.76 12.85 0.00 12.85 12.88 76.00 > sdf 0.00 0.00 0.00 62.00 0.00 248.00 > 8.00 0.78 12.89 0.00 12.89 12.58 78.00 > sdg 0.00 0.00 0.00 62.00 0.00 248.00 > 8.00 0.77 12.71 0.00 12.71 12.45 77.20 > md1 0.00 0.00 0.00 40.00 0.00 160.00 > 8.00 0.00 0.00 0.00 0.00 0.00 0.00 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > sde 0.00 0.00 0.00 66.00 0.00 264.00 > 8.00 0.80 12.39 0.00 12.39 12.09 79.80 > sdh 0.00 0.00 0.00 62.00 0.00 248.00 > 8.00 0.78 12.87 0.00 12.87 12.58 78.00 > sdf 0.00 0.00 0.00 66.00 0.00 264.00 > 8.00 0.78 11.82 0.00 11.82 11.82 78.00 > sdg 0.00 0.00 0.00 62.00 0.00 248.00 > 8.00 0.80 12.82 0.00 12.82 12.85 79.70 > md1 0.00 0.00 0.00 43.00 0.00 172.00 > 8.00 0.00 0.00 0.00 0.00 0.00 0.00 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > sde 0.00 0.00 0.00 65.00 0.00 260.00 > 8.00 0.81 12.43 0.00 12.43 12.40 80.60 > sdh 0.00 0.00 0.00 58.00 0.00 232.00 > 8.00 0.74 12.81 0.00 12.81 12.78 74.10 > sdf 0.00 0.00 0.00 71.00 0.00 284.00 > 8.00 0.81 11.38 0.00 11.38 11.34 80.50 > sdg 0.00 0.00 0.00 64.00 0.00 256.00 > 8.00 0.82 12.77 0.00 12.77 12.73 81.50 > md1 0.00 0.00 0.00 43.00 0.00 172.00 > 8.00 0.00 0.00 0.00 0.00 0.00 0.00 > > I don't see any good explanation for this, kindly waiting for your advice. > I really did want to see the multi-dimensional collection of data points, rather than just one. It is hard to see patterns in a single number. RAID5 and RAID10 are not directly comparable. For every block written to the array, RAID10 writes 2 block, and RAID5 writes 1.33 (on average). So you would expect 50% more writes to blocks in a (4 device) RAID10. Also each bit in the bitmap for RAID10 covers less space, so you get more bitmap updates. I don't think this quite covers the difference though. 40 writes to the array, 60 writes to each device is a little high. I think that is worst case. Every write to the array updates the bitmap on all devices, and the data on 2 devices. So it seems like every write is being handled synchronously with no write combining. Normally multiple bitmap updates are handled with a single write. Having only one job doing direct IO would quite possibly cause this worst-case performance (though I don't know details about how fio works). Try using buffered io (that easily allows more parallelism), and try multiple concurrent threads. NeilBrown
Attachment:
signature.asc
Description: PGP signature