Re: random_generator=lfsr overhead with more disks?

Sitsofe Wheeler <sitsofe@xxxxxxxxx> · Thu, 29 Mar 2018 09:56:35 +0100

On 29 March 2018 at 08:24, Michael Green <mishagreen@xxxxxxxxx> wrote:
> Sorry for the delay here. Please see below inline.
>
>> On Mar 17, 2018, at 1:42 AM, Sitsofe Wheeler <sitsofe@xxxxxxxxx> wrote:
>>
>> Could you repeat the problem on a recent version of fio (see
>> https://github.com/axboe/fio/releases for what we're up to)?
>
> Sure. Here are the results with latest fio.
>
> With LFSR
>
> [root@sm28 fio-master]# fio --name=global --thread=1 --direct=1 --group_reporting=1 --iomem_align=4k --name=PT7 --rw=randrw --rwmixread=100 --iodepth=40 --numjobs=8 --bs=4096 --size=450GiB --runtime=120 --filename='/dev/e8b0:/dev/e8b1:/dev/e8b2:/dev/e8b3:/dev/e8b4:/dev/e8b5:/dev/e8b6:/dev/e8b7:/dev/e8b8:/dev/e8b9:/dev/e8b10:/dev/e8b11:/dev/e8b12:/dev/e8b13:/dev/e8b14:/dev/e8b15' --ioengine=libaio --numa_cpu_nodes=0 --random_generator=lfsr
> PT7: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=40
> ...
> fio-3.5-80-gb348
> Starting 8 threads
> Jobs: 8 (f=128): [r(8)][100.0%][r=3532MiB/s,w=0KiB/s][r=904k,w=0 IOPS][eta 00m:00s]
> PT7: (groupid=0, jobs=8): err= 0: pid=28376: Thu Mar 29 07:35:44 2018
>    read: IOPS=895k, BW=3496MiB/s (3666MB/s)(410GiB/120002msec)
>     slat (nsec): min=1619, max=964358, avg=3682.85, stdev=4806.21
>     clat (usec): min=25, max=13002, avg=353.13, stdev=192.55
>      lat (usec): min=35, max=13007, avg=356.96, stdev=192.42
>     clat percentiles (usec):
>      |  1.00th=[   91],  5.00th=[  141], 10.00th=[  169], 20.00th=[  206],
>      | 30.00th=[  239], 40.00th=[  269], 50.00th=[  306], 60.00th=[  347],
>      | 70.00th=[  400], 80.00th=[  474], 90.00th=[  603], 95.00th=[  742],
>      | 99.00th=[ 1020], 99.50th=[ 1123], 99.90th=[ 1352], 99.95th=[ 1434],
>      | 99.99th=[ 1647]
>    bw (  KiB/s): min=373440, max=481952, per=12.50%, avg=447480.00, stdev=12566.36, samples=1918
>    iops        : min=93360, max=120488, avg=111869.99, stdev=3141.58, samples=1918
>   lat (usec)   : 50=0.02%, 100=1.45%, 250=32.29%, 500=49.13%, 750=12.35%
>   lat (usec)   : 1000=3.60%
>   lat (msec)   : 2=1.15%, 4=0.01%, 10=0.01%, 20=0.01%
>   cpu          : usr=17.66%, sys=49.66%, ctx=9942380, majf=0, minf=4898
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued rwts: total=107404316,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=40
>
> Run status group 0 (all jobs):
>    READ: bw=3496MiB/s (3666MB/s), 3496MiB/s-3496MiB/s (3666MB/s-3666MB/s), io=410GiB (440GB), run=120002-120002msec
>
> Disk stats (read/write):
>   e8b0: ios=6710489/0, merge=0/0, ticks=1818366/0, in_queue=1822176, util=99.74%
>   e8b1: ios=6710487/0, merge=0/0, ticks=1993186/0, in_queue=1995781, util=99.75%
>   e8b2: ios=6710490/0, merge=0/0, ticks=2076054/0, in_queue=2080427, util=99.76%
>   e8b3: ios=6710491/0, merge=0/0, ticks=2101963/0, in_queue=2107744, util=99.80%
>   e8b4: ios=6710493/0, merge=0/0, ticks=2167111/0, in_queue=2169552, util=99.79%
>   e8b5: ios=6710496/0, merge=0/0, ticks=2149837/0, in_queue=2153109, util=99.85%
>   e8b6: ios=6710497/0, merge=0/0, ticks=1966688/0, in_queue=1970940, util=99.85%
>   e8b7: ios=6710496/0, merge=0/0, ticks=1984307/0, in_queue=1989317, util=99.87%
>   e8b8: ios=6710497/0, merge=0/0, ticks=1985081/0, in_queue=1989662, util=99.88%
>   e8b9: ios=6710498/0, merge=0/0, ticks=1995815/0, in_queue=2000669, util=99.92%
>   e8b10: ios=6710498/0, merge=0/0, ticks=2005176/0, in_queue=2009368, util=99.94%
>   e8b11: ios=6710498/0, merge=0/0, ticks=2022758/0, in_queue=2027682, util=99.99%
>   e8b12: ios=6710499/0, merge=0/0, ticks=1996747/0, in_queue=2001118, util=100.00%
>   e8b13: ios=6710502/0, merge=0/0, ticks=2034211/0, in_queue=2039490, util=100.00%
>   e8b14: ios=6710502/0, merge=0/0, ticks=2035394/0, in_queue=2040469, util=100.00%
>   e8b15: ios=6710505/0, merge=0/0, ticks=2010598/0, in_queue=2017014, util=100.00%
>
>
> Without LFSR
>
> [root@sm28 fio-master]# fio --name=global --thread=1 --direct=1 --group_reporting=1 --iomem_align=4k --name=PT7 --rw=randrw --rwmixread=100 --iodepth=40 --numjobs=8 --bs=4096 --size=450GiB --runtime=120 --filename='/dev/e8b0:/dev/e8b1:/dev/e8b2:/dev/e8b3:/dev/e8b4:/dev/e8b5:/dev/e8b6:/dev/e8b7:/dev/e8b8:/dev/e8b9:/dev/e8b10:/dev/e8b11:/dev/e8b12:/dev/e8b13:/dev/e8b14:/dev/e8b15' --ioengine=libaio --numa_cpu_nodes=0
> PT7: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=40
> ...
> fio-3.5-80-gb348
> Starting 8 threads
> Jobs: 8 (f=128): [r(8)][100.0%][r=3729MiB/s,w=0KiB/s][r=955k,w=0 IOPS][eta 00m:00s]
> PT7: (groupid=0, jobs=8): err= 0: pid=28564: Thu Mar 29 07:40:09 2018
>    read: IOPS=943k, BW=3684MiB/s (3863MB/s)(432GiB/120007msec)
>     slat (nsec): min=1656, max=1601.7k, avg=4342.55, stdev=6792.60
>     clat (usec): min=31, max=12646, avg=333.84, stdev=102.52
>      lat (usec): min=38, max=12648, avg=338.34, stdev=101.86
>     clat percentiles (usec):
>      |  1.00th=[  120],  5.00th=[  167], 10.00th=[  204], 20.00th=[  249],
>      | 30.00th=[  281], 40.00th=[  310], 50.00th=[  334], 60.00th=[  363],
>      | 70.00th=[  392], 80.00th=[  420], 90.00th=[  457], 95.00th=[  482],
>      | 99.00th=[  545], 99.50th=[  594], 99.90th=[  865], 99.95th=[  955],
>      | 99.99th=[ 1254]
>    bw (  KiB/s): min=390520, max=518192, per=12.50%, avg=471560.73, stdev=13752.55, samples=1914
>    iops        : min=97630, max=129548, avg=117890.14, stdev=3438.13, samples=1914
>   lat (usec)   : 50=0.01%, 100=0.37%, 250=20.22%, 500=76.31%, 750=2.87%
>   lat (usec)   : 1000=0.20%
>   lat (msec)   : 2=0.03%, 4=0.01%, 10=0.01%, 20=0.01%
>   cpu          : usr=23.47%, sys=59.78%, ctx=10930907, majf=0, minf=299365
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued rwts: total=113187098,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=40
>
> Run status group 0 (all jobs):
>    READ: bw=3684MiB/s (3863MB/s), 3684MiB/s-3684MiB/s (3863MB/s-3863MB/s), io=432GiB (464GB), run=120007-120007msec
>
> Disk stats (read/write):
>   e8b0: ios=7070562/0, merge=0/0, ticks=1862741/0, in_queue=1868144, util=99.78%
>   e8b1: ios=7070570/0, merge=0/0, ticks=1996123/0, in_queue=1999170, util=99.79%
>   e8b2: ios=7070580/0, merge=0/0, ticks=2019351/0, in_queue=2024920, util=99.78%
>   e8b3: ios=7070581/0, merge=0/0, ticks=2018430/0, in_queue=2024167, util=99.80%
>   e8b4: ios=7070585/0, merge=0/0, ticks=2069985/0, in_queue=2072530, util=99.78%
>   e8b5: ios=7070586/0, merge=0/0, ticks=2032085/0, in_queue=2035097, util=99.81%
>   e8b6: ios=7070586/0, merge=0/0, ticks=1838167/0, in_queue=1842676, util=99.79%
>   e8b7: ios=7070589/0, merge=0/0, ticks=1838162/0, in_queue=1843175, util=99.83%
>   e8b8: ios=7070587/0, merge=0/0, ticks=1837259/0, in_queue=1842283, util=99.86%
>   e8b9: ios=7070595/0, merge=0/0, ticks=1836983/0, in_queue=1841398, util=99.89%
>   e8b10: ios=7070601/0, merge=0/0, ticks=1835967/0, in_queue=1840679, util=99.91%
>   e8b11: ios=7070605/0, merge=0/0, ticks=1835374/0, in_queue=1840121, util=99.94%
>   e8b12: ios=7070609/0, merge=0/0, ticks=1834964/0, in_queue=1839546, util=99.98%
>   e8b13: ios=7070609/0, merge=0/0, ticks=1835191/0, in_queue=1839604, util=100.00%
>   e8b14: ios=7070608/0, merge=0/0, ticks=1835130/0, in_queue=1840084, util=100.00%
>   e8b15: ios=7070611/0, merge=0/0, ticks=1835606/0, in_queue=1842956, util=100.00%
>
>
> Initially, I thought that there is a clear improvement here as the latency gap is smaller now. But then I decided to follow your advice of stripping the command down to the bare minimum.
>
>>  It would also
>> help if you strip the line you are using down to the bare minimum that
>> still shows the problem (e.g. if you can remove numa, lock it to CPUs
>> make it happen on a pure randread workload etc).
>
>
> Now when it comes to removing the flags, I didn’t want to remove the --numa flag because in E8 architecture that may adversely affect latency standard deviation. Reason being that FIO may occasionally get scheduled to run on the same core where where e8 driver is running. So the two flags that I could get rid of were --iomem_align and --size.  After experimenting a bit, I eventually set out to run a series of tests with every permutation of the two flags, including a test without them. For every permutation, the test was repeated 3 times and latencies recorded. Nothing else was running on either the storage controller or the host during the testing.
>
> base command was
> fio --name=global --thread=1 --direct=1 --group_reporting=1 --name=PT7 --rw=randrw --rwmixread=100 --iodepth=40 --numjobs=8 --bs=4096 --runtime=120 --filename='/dev/e8b0:/dev/e8b1:/dev/e8b2:/dev/e8b3:/dev/e8b4:/dev/e8b5:/dev/e8b6:/dev/e8b7:/dev/e8b8:/dev/e8b9:/dev/e
> 8b10:/dev/e8b11:/dev/e8b12:/dev/e8b13:/dev/e8b14:/dev/e8b15' --ioengine=libaio --random_generator=lfsr
>
> and
>
> fio --name=global --thread=1 --direct=1 --group_reporting=1 --name=PT7 --rw=randrw --rwmixread=100 --iodepth=40 --numjobs=8 --bs=4096 --runtime=120 --filename='/dev/e8b0:/dev/e8b1:/dev/e8b2:/dev/e8b3:/dev/e8b4:/dev/e8b5:/dev/e8b6:/dev/e8b7:/dev/e8b8:/dev/e8b9:/dev/e
> 8b10:/dev/e8b11:/dev/e8b12:/dev/e8b13:/dev/e8b14:/dev/e8b15' --ioengine=libaio
>
> The only difference is  --random_generator=lfsr or not.
> The iomem and size flags were iomem_align=4k and size=450GiB
>
> Here are the latency numbers I’ve got:
>
> LFSR
>
> + iomem + size = 348.07, 347.68, 347.59
> nothing = 344.28, 344.84, 345.04
> + size=349.28, 348.37, 348.37
> + iomem = 346.81, 346.05, 344.74
>
> NO LFSR
>
> + iomem + size =  344.03, 344.29, 345.47
> nothing = 347.05, 346.00, 346.03
> + size = 345.43, 343.55, 343.98
> + iomem = 347.87, 347.02, 347.84
>
> It appears that with both flags, LFSR is still behind NO LFSR even so minimally. When both flags are omitted the picture reverses.
> Looking at the overall picture, I cannot identify clear winner here. It’s almost like the differences are within measurement error.

If you start cutting down the number of disks being used does that
help or make things worse?

>> If it
>> happens there we could do with the output from Linux's perf
>
> Not sure what “Linux’s perf” exactly means.

The tool referred to over in http://www.brendangregg.com/perf.html and
https://perf.wiki.kernel.org/index.php/Main_Page . It's distributed
with the kernel but these days most distros should package it.

-- 
Sitsofe | http://sucs.org/~sits/
--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html