On Tue, Aug 6, 2019 at 1:46 AM NeilBrown <neilb@xxxxxxxx> wrote: > > On Mon, Aug 05 2019, Jinpu Wang wrote: > > > Hi Neil, > > > > For the md higher write IO latency problem, I bisected it to these commits: > > > > 4ad23a97 MD: use per-cpu counter for writes_pending > > 210f7cd percpu-refcount: support synchronous switch to atomic mode. > > > > Do you maybe have an idea? How can we fix it? > > Hmmm.... not sure. Hi Neil, Thanks for reply, detailed result in line. > > My guess is that the set_in_sync() call from md_check_recovery() > is taking a long time, and is being called too often. > > Could you try two experiments please. > Baseline on 5.3-rc3: root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage_2019_0806_092003.log write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 fio-2.2.10 Starting 1 process write-test: (groupid=0, jobs=1): err= 0: pid=2621: Tue Aug 6 09:20:44 2019 write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec slat (usec): min=2, max=69992, avg= 5.37, stdev=374.95 clat (usec): min=0, max=147, avg= 2.42, stdev=13.57 lat (usec): min=2, max=70079, avg= 7.84, stdev=376.07 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 1], | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1], | 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], | 99.00th=[ 96], 99.50th=[ 125], 99.90th=[ 137], 99.95th=[ 139], | 99.99th=[ 141] bw (KB /s): min=18454, max=21608, per=100.00%, avg=20005.15, stdev=352.24 lat (usec) : 2=98.52%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.06% lat (usec) : 100=0.46%, 250=0.94% cpu : usr=4.64%, sys=0.00%, ctx=197118, majf=0, minf=11 IO depths : 1=98.5%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=1.3%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s, mint=40001msec, maxt=40001msec Disk stats (read/write): md0: ios=60/199436, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% > 1/ set /sys/block/md0/md/safe_mode_delay > to 20 or more. It defaults to about 0.2. only set 20 to safe_mode_delay, give a nice improvement. root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage_2019_0806_092144.log write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 fio-2.2.10 Starting 1 process write-test: (groupid=0, jobs=1): err= 0: pid=2676: Tue Aug 6 09:22:25 2019 write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec slat (usec): min=2, max=99490, avg= 2.98, stdev=222.46 clat (usec): min=0, max=103, avg= 0.96, stdev= 4.51 lat (usec): min=2, max=99581, avg= 3.99, stdev=222.71 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1], | 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 90], 99.95th=[ 91], | 99.99th=[ 95] bw (KB /s): min=20000, max=20008, per=100.00%, avg=20001.82, stdev= 3.38 lat (usec) : 2=99.72%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% lat (usec) : 100=0.25%, 250=0.01% cpu : usr=3.17%, sys=1.48%, ctx=199470, majf=0, minf=11 IO depths : 1=99.7%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s, mint=40001msec, maxt=40001msec Disk stats (read/write): md0: ios=60/199461, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% > > 2/ comment out the call the set_in_sync() in md_check_recovery(). Only commented out set_in_sync get a better improvement root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage+_2019_0806_093340.log write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 fio-2.2.10 Starting 1 process write-test: (groupid=0, jobs=1): err= 0: pid=2626: Tue Aug 6 09:34:20 2019 write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec slat (usec): min=2, max=29, avg= 2.49, stdev= 0.72 clat (usec): min=0, max=101, avg= 0.78, stdev= 1.17 lat (usec): min=2, max=117, avg= 3.34, stdev= 1.25 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1], | 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1], | 99.99th=[ 72] bw (KB /s): min=20000, max=20008, per=100.00%, avg=20002.03, stdev= 3.50 lat (usec) : 2=99.96%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% lat (usec) : 100=0.02%, 250=0.01% cpu : usr=4.17%, sys=0.00%, ctx=199951, majf=0, minf=12 IO depths : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s, mint=40001msec, maxt=40001msec Disk stats (read/write): md0: ios=60/199435, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% With both applied root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage+_2019_0806_093916.log write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 fio-2.2.10 Starting 1 process write-test: (groupid=0, jobs=1): err= 0: pid=2709: Tue Aug 6 09:39:57 2019 write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec slat (usec): min=2, max=16, avg= 2.46, stdev= 0.69 clat (usec): min=0, max=100, avg= 0.61, stdev= 1.18 lat (usec): min=2, max=104, avg= 3.12, stdev= 1.33 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 1], 60.00th=[ 1], | 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1], | 99.99th=[ 70] bw (KB /s): min=20000, max=20008, per=100.00%, avg=20002.73, stdev= 3.82 lat (usec) : 2=99.96%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% lat (usec) : 100=0.02%, 250=0.01% cpu : usr=3.33%, sys=1.31%, ctx=199941, majf=0, minf=12 IO depths : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s, mint=40001msec, maxt=40001msec Disk stats (read/write): md0: ios=60/199460, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% > > Then run the least separately after each of these changes. > > I the second one makes a difference, I'd like to know how often it gets > called - and why. The test > if ( ! ( > (mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) || > test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || > test_bit(MD_RECOVERY_DONE, &mddev->recovery) || > (mddev->external == 0 && mddev->safemode == 1) || > (mddev->safemode == 2 > && !mddev->in_sync && mddev->recovery_cp == MaxSector) > )) > return; > > should normally return when doing lots of IO - I'd like to know > which condition causes it to not return. I will check, and report later today. Thanks again! Jack Wang > > Thanks, > NeilBrown