Re: internal write-intent bitmap is horribly slow with RAID10 over 20 drives

NeilBrown <neilb@xxxxxxxx> · Wed, 14 Jun 2017 11:40:22 +1000

On Mon, Jun 12 2017, CoolCold wrote:

> Hello!
> i've started doing testing as proposed by you and found other strange
> behavior with _4_ drives ~ 44 iops as well:
> mdadm --create --assume-clean -c $((64*1)) -b internal
> --bitmap-chunk=$((128*1024)) -n 4 -l 10 /dev/md1 /dev/sde /dev/sdf
> /dev/sdg /dev/sdh
>
> mdstat:
> [root@spare-a17484327407661 rovchinnikov]# cat /proc/mdstat
> Personalities : [raid1] [raid10] [raid6] [raid5] [raid4]
> md1 : active raid10 sdh[3] sdg[2] sdf[1] sde[0]
>       3516066176 blocks super 1.2 64K chunks 2 near-copies [4/4] [UUUU]
>       bitmap: 0/14 pages [0KB], 131072KB chunk
>
>
> fio:
> [root@spare-a17484327407661 tests]# fio --runtime 60 randwrite.conf
> randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> iodepth=512
> fio-2.2.8
> Starting 1 process
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/179KB/0KB /s] [0/44/0 iops]
> [eta 00m:00s]
> randwrite: (groupid=0, jobs=1): err= 0: pid=35048: Mon Jun 12 06:03:28 2017
>   write: io=35728KB, bw=609574B/s, iops=148, runt= 60018msec
>     slat (usec): min=6, max=3006.6K, avg=6714.32, stdev=33548.39
>     clat (usec): min=137, max=14323K, avg=3430245.54, stdev=4822029.01
>      lat (msec): min=22, max=14323, avg=3436.96, stdev=4830.87
>     clat percentiles (msec):
>      |  1.00th=[   40],  5.00th=[   76], 10.00th=[   87], 20.00th=[  115],
>      | 30.00th=[  437], 40.00th=[  510], 50.00th=[  553], 60.00th=[  619],
>      | 70.00th=[ 2376], 80.00th=[11600], 90.00th=[11731], 95.00th=[11863],
>      | 99.00th=[12387], 99.50th=[13435], 99.90th=[14091], 99.95th=[14222],
>      | 99.99th=[14353]
>     bw (KB  /s): min=  111, max=14285, per=95.41%, avg=567.70, stdev=1623.95
>     lat (usec) : 250=0.01%
>     lat (msec) : 50=2.02%, 100=12.52%, 250=7.01%, 500=17.02%, 750=30.62%
>     lat (msec) : 1000=0.12%, 2000=0.50%, >=2000=30.18%
>   cpu          : usr=0.06%, sys=0.34%, ctx=2607, majf=0, minf=30
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.3%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=0/w=8932/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=512
>
> Run status group 0 (all jobs):
>   WRITE: io=35728KB, aggrb=595KB/s, minb=595KB/s, maxb=595KB/s,
> mint=60018msec, maxt=60018msec
>
> Disk stats (read/write):
>     md1: ios=61/8928, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=16/6693, aggrmerge=0/0, aggrticks=13/512251,
> aggrin_queue=512265, aggrutil=83.63%
>   sde: ios=40/6787, merge=0/0, ticks=15/724812, in_queue=724824, util=83.63%
>   sdf: ios=2/6787, merge=0/0, ticks=5/694057, in_queue=694061, util=82.20%
>   sdg: ios=24/6599, merge=0/0, ticks=27/154988, in_queue=155022, util=80.72%
>   sdh: ios=1/6599, merge=0/0, ticks=6/475150, in_queue=475155, util=82.29%
>
>
>
>
>
> comparing to the same drives on RAID5 fio shows ~ 142 iops:
> [root@spare-a17484327407661 tests]# fio --runtime 60 randwrite.conf
> randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> iodepth=512
> fio-2.2.8
> Starting 1 process
> Jobs: 1 (f=1): [w(1)] [0.0% done] [0KB/571KB/0KB /s] [0/142/0 iops]
> [eta 93d:11h:20m:52s]
> randwrite: (groupid=0, jobs=1): err= 0: pid=34914: Mon Jun 12 05:59:23 2017
>   write: io=41880KB, bw=707115B/s, iops=172, runt= 60648msec
>
> raid5 created basically the same as for RAID10
> mdadm --create --assume-clean -c $((64*1)) -b internal
> --bitmap-chunk=$((128*1024)) -n 4 -l 5 /dev/md1 /dev/sde /dev/sdf
> /dev/sdg /dev/sdh
>
> mdstat output for raid5:
> [root@spare-a17484327407661 rovchinnikov]# cat /proc/mdstat
> Personalities : [raid1] [raid10] [raid6] [raid5] [raid4]
> md1 : active raid5 sdh[3] sdg[2] sdf[1] sde[0]
>       5274099264 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
>       bitmap: 7/7 pages [28KB], 131072KB chunk
>
>
> for both cases, the same fio config:
> [root@spare-a17484327407661 tests]# cat randwrite.conf
> [randwrite]
> blocksize=4k
> #blocksize=64k
> filename=/dev/md1
> #filename=/dev/md2
> readwrite=randwrite
> #rwmixread=75
> direct=1
> buffered=0
> ioengine=libaio
> iodepth=512
> #numjobs=4
> group_reporting=1
>
> from iostat, hard drives are having more requests than md1 (compare
> 40-43 on md1 and ~ 60 per device)
> [root@spare-a17484327407661 rovchinnikov]# iostat -d -xk 1  /dev/md1
> /dev/sde /dev/sdf /dev/sdg /dev/sdh
> Linux 3.10.0-327.el7.x86_64 (spare-a17484327407661.sgdc)
> 06/12/2017      _x86_64_        (40 CPU)
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sde               0.00     0.00    0.00   59.00     0.00   236.00
> 8.00     0.76   12.76    0.00   12.76  12.75  75.20
> sdh               0.00     0.00    0.00   59.00     0.00   236.00
> 8.00     0.76   12.85    0.00   12.85  12.88  76.00
> sdf               0.00     0.00    0.00   62.00     0.00   248.00
> 8.00     0.78   12.89    0.00   12.89  12.58  78.00
> sdg               0.00     0.00    0.00   62.00     0.00   248.00
> 8.00     0.77   12.71    0.00   12.71  12.45  77.20
> md1               0.00     0.00    0.00   40.00     0.00   160.00
> 8.00     0.00    0.00    0.00    0.00   0.00   0.00
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sde               0.00     0.00    0.00   66.00     0.00   264.00
> 8.00     0.80   12.39    0.00   12.39  12.09  79.80
> sdh               0.00     0.00    0.00   62.00     0.00   248.00
> 8.00     0.78   12.87    0.00   12.87  12.58  78.00
> sdf               0.00     0.00    0.00   66.00     0.00   264.00
> 8.00     0.78   11.82    0.00   11.82  11.82  78.00
> sdg               0.00     0.00    0.00   62.00     0.00   248.00
> 8.00     0.80   12.82    0.00   12.82  12.85  79.70
> md1               0.00     0.00    0.00   43.00     0.00   172.00
> 8.00     0.00    0.00    0.00    0.00   0.00   0.00
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sde               0.00     0.00    0.00   65.00     0.00   260.00
> 8.00     0.81   12.43    0.00   12.43  12.40  80.60
> sdh               0.00     0.00    0.00   58.00     0.00   232.00
> 8.00     0.74   12.81    0.00   12.81  12.78  74.10
> sdf               0.00     0.00    0.00   71.00     0.00   284.00
> 8.00     0.81   11.38    0.00   11.38  11.34  80.50
> sdg               0.00     0.00    0.00   64.00     0.00   256.00
> 8.00     0.82   12.77    0.00   12.77  12.73  81.50
> md1               0.00     0.00    0.00   43.00     0.00   172.00
> 8.00     0.00    0.00    0.00    0.00   0.00   0.00
>
> I don't see any good explanation for this, kindly waiting for your advice.
>

I really did want to see the multi-dimensional collection of data
points, rather than just one.  It is hard to see patterns in a single
number.

RAID5 and RAID10 are not directly comparable.
For every block written to the array, RAID10 writes 2 block, and RAID5
writes 1.33 (on average).  So you would expect 50% more writes to blocks
in a (4 device) RAID10.
Also each bit in the bitmap for RAID10 covers less space, so you get
more bitmap updates.
I don't think this quite covers the difference though.

40 writes to the array, 60 writes to each device is a little high.
I think that is worst case.
Every write to the array updates the bitmap on all devices, and the data
on 2 devices.
So it seems like every write is being handled synchronously with no
write combining.  Normally multiple bitmap updates are handled with a
single write.

Having only one job doing direct IO would quite possibly cause this
worst-case performance (though I don't know details about how fio
works).

Try using buffered io (that easily allows more parallelism), and try
multiple concurrent threads.

NeilBrown
Attachment:
signature.asc

Description: PGP signature