Re: random_generator=lfsr overhead with more disks?

Michael Green <mishagreen@xxxxxxxxx> · Thu, 29 Mar 2018 00:24:16 -0700

Sorry for the delay here. Please see below inline.

> On Mar 17, 2018, at 1:42 AM, Sitsofe Wheeler <sitsofe@xxxxxxxxx> wrote:
> 
> Could you repeat the problem on a recent version of fio (see
> https://github.com/axboe/fio/releases for what we're up to)?

Sure. Here are the results with latest fio.

With LFSR

[root@sm28 fio-master]# fio --name=global --thread=1 --direct=1 --group_reporting=1 --iomem_align=4k --name=PT7 --rw=randrw --rwmixread=100 --iodepth=40 --numjobs=8 --bs=4096 --size=450GiB --runtime=120 --filename='/dev/e8b0:/dev/e8b1:/dev/e8b2:/dev/e8b3:/dev/e8b4:/dev/e8b5:/dev/e8b6:/dev/e8b7:/dev/e8b8:/dev/e8b9:/dev/e8b10:/dev/e8b11:/dev/e8b12:/dev/e8b13:/dev/e8b14:/dev/e8b15' --ioengine=libaio --numa_cpu_nodes=0 --random_generator=lfsr
PT7: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=40
...
fio-3.5-80-gb348
Starting 8 threads
Jobs: 8 (f=128): [r(8)][100.0%][r=3532MiB/s,w=0KiB/s][r=904k,w=0 IOPS][eta 00m:00s]
PT7: (groupid=0, jobs=8): err= 0: pid=28376: Thu Mar 29 07:35:44 2018
   read: IOPS=895k, BW=3496MiB/s (3666MB/s)(410GiB/120002msec)
    slat (nsec): min=1619, max=964358, avg=3682.85, stdev=4806.21
    clat (usec): min=25, max=13002, avg=353.13, stdev=192.55
     lat (usec): min=35, max=13007, avg=356.96, stdev=192.42
    clat percentiles (usec):
     |  1.00th=[   91],  5.00th=[  141], 10.00th=[  169], 20.00th=[  206],
     | 30.00th=[  239], 40.00th=[  269], 50.00th=[  306], 60.00th=[  347],
     | 70.00th=[  400], 80.00th=[  474], 90.00th=[  603], 95.00th=[  742],
     | 99.00th=[ 1020], 99.50th=[ 1123], 99.90th=[ 1352], 99.95th=[ 1434],
     | 99.99th=[ 1647]
   bw (  KiB/s): min=373440, max=481952, per=12.50%, avg=447480.00, stdev=12566.36, samples=1918
   iops        : min=93360, max=120488, avg=111869.99, stdev=3141.58, samples=1918
  lat (usec)   : 50=0.02%, 100=1.45%, 250=32.29%, 500=49.13%, 750=12.35%
  lat (usec)   : 1000=3.60%
  lat (msec)   : 2=1.15%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=17.66%, sys=49.66%, ctx=9942380, majf=0, minf=4898
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=107404316,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=40

Run status group 0 (all jobs):
   READ: bw=3496MiB/s (3666MB/s), 3496MiB/s-3496MiB/s (3666MB/s-3666MB/s), io=410GiB (440GB), run=120002-120002msec

Disk stats (read/write):
  e8b0: ios=6710489/0, merge=0/0, ticks=1818366/0, in_queue=1822176, util=99.74%
  e8b1: ios=6710487/0, merge=0/0, ticks=1993186/0, in_queue=1995781, util=99.75%
  e8b2: ios=6710490/0, merge=0/0, ticks=2076054/0, in_queue=2080427, util=99.76%
  e8b3: ios=6710491/0, merge=0/0, ticks=2101963/0, in_queue=2107744, util=99.80%
  e8b4: ios=6710493/0, merge=0/0, ticks=2167111/0, in_queue=2169552, util=99.79%
  e8b5: ios=6710496/0, merge=0/0, ticks=2149837/0, in_queue=2153109, util=99.85%
  e8b6: ios=6710497/0, merge=0/0, ticks=1966688/0, in_queue=1970940, util=99.85%
  e8b7: ios=6710496/0, merge=0/0, ticks=1984307/0, in_queue=1989317, util=99.87%
  e8b8: ios=6710497/0, merge=0/0, ticks=1985081/0, in_queue=1989662, util=99.88%
  e8b9: ios=6710498/0, merge=0/0, ticks=1995815/0, in_queue=2000669, util=99.92%
  e8b10: ios=6710498/0, merge=0/0, ticks=2005176/0, in_queue=2009368, util=99.94%
  e8b11: ios=6710498/0, merge=0/0, ticks=2022758/0, in_queue=2027682, util=99.99%
  e8b12: ios=6710499/0, merge=0/0, ticks=1996747/0, in_queue=2001118, util=100.00%
  e8b13: ios=6710502/0, merge=0/0, ticks=2034211/0, in_queue=2039490, util=100.00%
  e8b14: ios=6710502/0, merge=0/0, ticks=2035394/0, in_queue=2040469, util=100.00%
  e8b15: ios=6710505/0, merge=0/0, ticks=2010598/0, in_queue=2017014, util=100.00%

Without LFSR

[root@sm28 fio-master]# fio --name=global --thread=1 --direct=1 --group_reporting=1 --iomem_align=4k --name=PT7 --rw=randrw --rwmixread=100 --iodepth=40 --numjobs=8 --bs=4096 --size=450GiB --runtime=120 --filename='/dev/e8b0:/dev/e8b1:/dev/e8b2:/dev/e8b3:/dev/e8b4:/dev/e8b5:/dev/e8b6:/dev/e8b7:/dev/e8b8:/dev/e8b9:/dev/e8b10:/dev/e8b11:/dev/e8b12:/dev/e8b13:/dev/e8b14:/dev/e8b15' --ioengine=libaio --numa_cpu_nodes=0
PT7: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=40
...
fio-3.5-80-gb348
Starting 8 threads
Jobs: 8 (f=128): [r(8)][100.0%][r=3729MiB/s,w=0KiB/s][r=955k,w=0 IOPS][eta 00m:00s]
PT7: (groupid=0, jobs=8): err= 0: pid=28564: Thu Mar 29 07:40:09 2018
   read: IOPS=943k, BW=3684MiB/s (3863MB/s)(432GiB/120007msec)
    slat (nsec): min=1656, max=1601.7k, avg=4342.55, stdev=6792.60
    clat (usec): min=31, max=12646, avg=333.84, stdev=102.52
     lat (usec): min=38, max=12648, avg=338.34, stdev=101.86
    clat percentiles (usec):
     |  1.00th=[  120],  5.00th=[  167], 10.00th=[  204], 20.00th=[  249],
     | 30.00th=[  281], 40.00th=[  310], 50.00th=[  334], 60.00th=[  363],
     | 70.00th=[  392], 80.00th=[  420], 90.00th=[  457], 95.00th=[  482],
     | 99.00th=[  545], 99.50th=[  594], 99.90th=[  865], 99.95th=[  955],
     | 99.99th=[ 1254]
   bw (  KiB/s): min=390520, max=518192, per=12.50%, avg=471560.73, stdev=13752.55, samples=1914
   iops        : min=97630, max=129548, avg=117890.14, stdev=3438.13, samples=1914
  lat (usec)   : 50=0.01%, 100=0.37%, 250=20.22%, 500=76.31%, 750=2.87%
  lat (usec)   : 1000=0.20%
  lat (msec)   : 2=0.03%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=23.47%, sys=59.78%, ctx=10930907, majf=0, minf=299365
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=113187098,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=40

Run status group 0 (all jobs):
   READ: bw=3684MiB/s (3863MB/s), 3684MiB/s-3684MiB/s (3863MB/s-3863MB/s), io=432GiB (464GB), run=120007-120007msec

Disk stats (read/write):
  e8b0: ios=7070562/0, merge=0/0, ticks=1862741/0, in_queue=1868144, util=99.78%
  e8b1: ios=7070570/0, merge=0/0, ticks=1996123/0, in_queue=1999170, util=99.79%
  e8b2: ios=7070580/0, merge=0/0, ticks=2019351/0, in_queue=2024920, util=99.78%
  e8b3: ios=7070581/0, merge=0/0, ticks=2018430/0, in_queue=2024167, util=99.80%
  e8b4: ios=7070585/0, merge=0/0, ticks=2069985/0, in_queue=2072530, util=99.78%
  e8b5: ios=7070586/0, merge=0/0, ticks=2032085/0, in_queue=2035097, util=99.81%
  e8b6: ios=7070586/0, merge=0/0, ticks=1838167/0, in_queue=1842676, util=99.79%
  e8b7: ios=7070589/0, merge=0/0, ticks=1838162/0, in_queue=1843175, util=99.83%
  e8b8: ios=7070587/0, merge=0/0, ticks=1837259/0, in_queue=1842283, util=99.86%
  e8b9: ios=7070595/0, merge=0/0, ticks=1836983/0, in_queue=1841398, util=99.89%
  e8b10: ios=7070601/0, merge=0/0, ticks=1835967/0, in_queue=1840679, util=99.91%
  e8b11: ios=7070605/0, merge=0/0, ticks=1835374/0, in_queue=1840121, util=99.94%
  e8b12: ios=7070609/0, merge=0/0, ticks=1834964/0, in_queue=1839546, util=99.98%
  e8b13: ios=7070609/0, merge=0/0, ticks=1835191/0, in_queue=1839604, util=100.00%
  e8b14: ios=7070608/0, merge=0/0, ticks=1835130/0, in_queue=1840084, util=100.00%
  e8b15: ios=7070611/0, merge=0/0, ticks=1835606/0, in_queue=1842956, util=100.00%

Initially, I thought that there is a clear improvement here as the latency gap is smaller now. But then I decided to follow your advice of stripping the command down to the bare minimum.

>  It would also
> help if you strip the line you are using down to the bare minimum that
> still shows the problem (e.g. if you can remove numa, lock it to CPUs
> make it happen on a pure randread workload etc).

Now when it comes to removing the flags, I didn’t want to remove the --numa flag because in E8 architecture that may adversely affect latency standard deviation. Reason being that FIO may occasionally get scheduled to run on the same core where where e8 driver is running. So the two flags that I could get rid of were --iomem_align and --size.  After experimenting a bit, I eventually set out to run a series of tests with every permutation of the two flags, including a test without them. For every permutation, the test was repeated 3 times and latencies recorded. Nothing else was running on either the storage controller or the host during the testing. 

base command was 
fio --name=global --thread=1 --direct=1 --group_reporting=1 --name=PT7 --rw=randrw --rwmixread=100 --iodepth=40 --numjobs=8 --bs=4096 --runtime=120 --filename='/dev/e8b0:/dev/e8b1:/dev/e8b2:/dev/e8b3:/dev/e8b4:/dev/e8b5:/dev/e8b6:/dev/e8b7:/dev/e8b8:/dev/e8b9:/dev/e
8b10:/dev/e8b11:/dev/e8b12:/dev/e8b13:/dev/e8b14:/dev/e8b15' --ioengine=libaio --random_generator=lfsr

and 

fio --name=global --thread=1 --direct=1 --group_reporting=1 --name=PT7 --rw=randrw --rwmixread=100 --iodepth=40 --numjobs=8 --bs=4096 --runtime=120 --filename='/dev/e8b0:/dev/e8b1:/dev/e8b2:/dev/e8b3:/dev/e8b4:/dev/e8b5:/dev/e8b6:/dev/e8b7:/dev/e8b8:/dev/e8b9:/dev/e
8b10:/dev/e8b11:/dev/e8b12:/dev/e8b13:/dev/e8b14:/dev/e8b15' --ioengine=libaio 

The only difference is  --random_generator=lfsr or not.
The iomem and size flags were iomem_align=4k and size=450GiB

Here are the latency numbers I’ve got:

LFSR

+ iomem + size = 348.07, 347.68, 347.59
nothing = 344.28, 344.84, 345.04
+ size=349.28, 348.37, 348.37
+ iomem = 346.81, 346.05, 344.74

NO LFSR

+ iomem + size =  344.03, 344.29, 345.47
nothing = 347.05, 346.00, 346.03
+ size = 345.43, 343.55, 343.98
+ iomem = 347.87, 347.02, 347.84

It appears that with both flags, LFSR is still behind NO LFSR even so minimally. When both flags are omitted the picture reverses. 
Looking at the overall picture, I cannot identify clear winner here. It’s almost like the differences are within measurement error.

> If it
> happens there we could do with the output from Linux's perf

Not sure what “Linux’s perf” exactly means.

Michael

--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html