Re: help understanding the output of fio

Jeff Johnson <jeff.johnson@xxxxxxxxxxxxxxxxx> · Thu, 4 Apr 2024 11:43:21 -0700

It looks like you are running fio against the local directory within
your zpool. Your performance is really bad but this could be impacted
by zpool parameters as well. It is not a pure test of your SSD, which
you should do to rule out hardware or block device driver issues. You
can run a read only test and your zpool will not be affected. Please
take care to ensure all tests to your NVMe drive are read-only. Also,
one small point, ZFS does not support direct_io *yet*. Running a
direct_io job against a ZFS directory will have terrible and
unpredictable results. You can run direct_io against raw block
devices, XFS and ext4 file systems.

Examples:
[sequential read-only]
fio --name=seqread --numjobs=1 --time_based --runtime=60s
--ramp_time=2s --iodepth=8 --ioengine=libaio --direct=1 --verify=0
--group_reporting=1 --bs=1M --rw=read --filename=/dev/{$my_ssd}

[random read-only]
fio --name=randomread --numjobs=1 --time_based --runtime=60s
--ramp_time=2s --iodepth=8 --ioengine=libaio --direct=1 --verify=0
--group_reporting=1 --bs=4k --rw=randread --filename=/dev/{$my_ssd}

If you do not see sequential read performance or random IOPs within
70% of what the manufacturer's advertised specs for your drive you
should start reviewing dmesg/messages, check drive link states (pcie
width and speed for nvme, SATA link speed for SATA).

--Jeff

On Wed, Apr 3, 2024 at 9:28 PM Damien Le Moal <dlemoal@xxxxxxxxxx> wrote:
>
> On 4/4/24 04:21, Felix Rubio wrote:
> > Hi Patrick,
> >
> > Thank you for your answer. I am quite lost with the results shown by
> > fio: I do not get to understand them. I have a system with 3 ZFS pools:
> > in 1 I have a single SSD, in another I have 2 stripped SSD, and in the
> > third I have a single HDD. On the first pool I am running this command:
> >
> > fio --numjobs=1 --size=1024M --time_based --runtime=60s --ramp_time=2s
> > --ioengine=libaio --direct=1 --verify=0 --group_reporting=1 --bs=1M
> > --rw=write
>
> You did not specify --iodepth. If you want to run at QD=1, the use
> --ioengine=psync. For QD > 1, use --ioengine=libaio --iodepth=X (X > 1). You can
> use --ioengine=io_uring as well.
>
> Also, there is no filename= option here. Are you running this workload on the
> ZFS file system writing to a file ? Or ar you running this against the SSD block
> device file ?
>
> >
> > And I am getting this result:
> >
> > write_throughput_i: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W)
> > 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
> > fio-3.33
> > Starting 1 process
> > write_throughput_i: Laying out IO file (1 file / 1024MiB)
> > Jobs: 1 (f=1): [W(1)][21.1%][eta 03m:56s]
> > write_throughput_i: (groupid=0, jobs=1): err= 0: pid=235368: Wed Apr  3
> > 21:01:16 2024
> >    write: IOPS=3, BW=3562KiB/s (3647kB/s)(211MiB/60664msec); 0 zone
> > resets
>
> Your drive does 3 I/Os per second... That is very slow... Did you have a look at
> dmesg to see if there is anything going on with the driver ?
>
> You may want to start with testing 4K writes and increasing the block size from
> there (8k, 16k, 32k, ...) and see if there is a value that triggers the slow
> behavior.
>
> But this all look like a HW issue or FS/driver issue. What type of SSD is this ?
> NVMe ? ATA ?
>
> >      slat (msec): min=16, max=1777, avg=287.56, stdev=282.15
> >      clat (nsec): min=10930, max=31449, avg=12804.00, stdev=3289.32
> >       lat (msec): min=16, max=1777, avg=288.86, stdev=282.20
> >      clat percentiles (nsec):
> >       |  1.00th=[11072],  5.00th=[11328], 10.00th=[11456],
> > 20.00th=[11584],
> >       | 30.00th=[11840], 40.00th=[11840], 50.00th=[12096],
> > 60.00th=[12224],
> >       | 70.00th=[12352], 80.00th=[12736], 90.00th=[13248],
> > 95.00th=[13760],
> >       | 99.00th=[29568], 99.50th=[30848], 99.90th=[31360],
> > 99.95th=[31360],
> >       | 99.99th=[31360]
> >     bw (  KiB/s): min= 2048, max=51302, per=100.00%, avg=4347.05,
> > stdev=6195.46, samples=99
> >     iops        : min=    2, max=   50, avg= 4.24, stdev= 6.04,
> > samples=99
> >    lat (usec)   : 20=96.19%, 50=3.81%
> >    cpu          : usr=0.10%, sys=0.32%, ctx=1748, majf=0, minf=37
> >    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >  >=64=0.0%
> >       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >  >=64=0.0%
> >       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >  >=64=0.0%
> >       issued rwts: total=0,210,0,0 short=0,0,0,0 dropped=0,0,0,0
> >       latency   : target=0, window=0, percentile=100.00%, depth=1
> >
> > Run status group 0 (all jobs):
> >    WRITE: bw=3562KiB/s (3647kB/s), 3562KiB/s-3562KiB/s
> > (3647kB/s-3647kB/s), io=211MiB (221MB), run=60664-60664msec
> >
> > For a read test, running the same parameters (changing rw=write by
> > rw=read), I get:
> >
> > read_throughput_i: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W)
> > 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
> > fio-3.33
> > Starting 1 process
> > read_throughput_i: Laying out IO file (1 file / 1024MiB)
> > Jobs: 1 (f=1): [R(1)][100.0%][r=1229MiB/s][r=1229 IOPS][eta 00m:00s]
> > read_throughput_i: (groupid=0, jobs=1): err= 0: pid=245501: Wed Apr  3
> > 21:14:27 2024
> >    read: IOPS=1283, BW=1284MiB/s (1346MB/s)(75.2GiB/60001msec)
> >      slat (usec): min=371, max=27065, avg=757.71, stdev=185.43
> >      clat (usec): min=9, max=358, avg=12.44, stdev= 3.93
> >       lat (usec): min=383, max=27078, avg=770.15, stdev=185.66
> >      clat percentiles (usec):
> >       |  1.00th=[   11],  5.00th=[   12], 10.00th=[   12], 20.00th=[
> > 12],
> >       | 30.00th=[   12], 40.00th=[   12], 50.00th=[   13], 60.00th=[
> > 13],
> >       | 70.00th=[   13], 80.00th=[   13], 90.00th=[   13], 95.00th=[
> > 14],
> >       | 99.00th=[   24], 99.50th=[   29], 99.90th=[   81], 99.95th=[
> > 93],
> >       | 99.99th=[  105]
> >     bw (  MiB/s): min= 1148, max= 1444, per=100.00%, avg=1285.12,
> > stdev=86.12, samples=119
> >     iops        : min= 1148, max= 1444, avg=1284.97, stdev=86.05,
> > samples=119
> >    lat (usec)   : 10=0.09%, 20=98.90%, 50=0.73%, 100=0.26%, 250=0.02%
> >    lat (usec)   : 500=0.01%
> >    cpu          : usr=3.14%, sys=95.75%, ctx=6274, majf=0, minf=37
> >    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >  >=64=0.0%
> >       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >  >=64=0.0%
> >       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >  >=64=0.0%
> >       issued rwts: total=77036,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> >       latency   : target=0, window=0, percentile=100.00%, depth=1
> >
> > Run status group 0 (all jobs):
> >     READ: bw=1284MiB/s (1346MB/s), 1284MiB/s-1284MiB/s
> > (1346MB/s-1346MB/s), io=75.2GiB (80.8GB), run=60001-60001msec
> >
> >
> > If I understand properly these results, I am getting ~4 MBps of write
> > bandwidth to disk,caused by a submission latency problem , but when
> > reading sequentially I am getting 1.3 GBps (also very weird) (previously
> > I shared another test with different parameters, multiple jobs, etc.,
> > but with the same result). I am trying to find out if this analysis is
> > correct and whether is a software or a hardware error.
> > ---
> > Felix Rubio
> > "Don't believe what you're told. Double check."
> >
> > On 2024-04-03 17:45, Patrick Goetz wrote:
> >> Hi Felix -
> >>
> >> It might be helpful to share your job file or fio command line. Not
> >> sure, I'm a neophyte as well with many questions such as what happens
> >> when the number of jobs is less than the iodepth.
> >>
> >> On 4/3/24 00:35, Felix Rubio wrote:
> >>> Hi everybody,
> >>>
> >>> I have started recently to use fio, and I am getting the following
> >>> output for sequential writes:
> >>>
> >>> write_throughput_i: (groupid=0, jobs=16): err= 0: pid=2301660: Tue Apr
> >>> 2 21:03:41 2024
> >>>    write: IOPS=2, BW=5613KiB/s (5748kB/s)(2048MiB/373607msec); 0 zone
> >>> resets
> >>>      slat (msec): min=41549, max=373605, avg=260175.71, stdev=76630.63
> >>>      clat (nsec): min=17445, max=31004, avg=20350.31, stdev=3744.64
> >>>       lat (msec): min=235566, max=373605, avg=318209.63,
> >>> stdev=32743.17
> >>>      clat percentiles (nsec):
> >>>       |  1.00th=[17536],  5.00th=[17536], 10.00th=[17792],
> >>> 20.00th=[17792],
> >>>       | 30.00th=[18048], 40.00th=[18304], 50.00th=[18304],
> >>> 60.00th=[18816],
> >>>       | 70.00th=[21632], 80.00th=[22144], 90.00th=[27008],
> >>> 95.00th=[31104],
> >>>       | 99.00th=[31104], 99.50th=[31104], 99.90th=[31104],
> >>> 99.95th=[31104],
> >>>       | 99.99th=[31104]
> >>>     bw (  MiB/s): min= 2051, max= 2051, per=100.00%, avg=2051.84,
> >>> stdev= 0.00, samples=16
> >>>     iops        : min= 2048, max= 2048, avg=2048.00, stdev= 0.00,
> >>> samples=16
> >>>    lat (usec)   : 20=68.75%, 50=31.25%
> >>>    cpu          : usr=0.00%, sys=0.02%, ctx=8350, majf=13, minf=633
> >>>    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >>>> =64=100.0%
> >>>       submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=100.0%,
> >>>  >=64=0.0%
> >>>       complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=100.0%,
> >>>  >=64=0.0%
> >>>       issued rwts: total=0,1024,0,0 short=0,0,0,0 dropped=0,0,0,0
> >>>       latency   : target=0, window=0, percentile=100.00%, depth=64
> >>>
> >>> Run status group 0 (all jobs):
> >>>    WRITE: bw=5613KiB/s (5748kB/s), 5613KiB/s-5613KiB/s
> >>> (5748kB/s-5748kB/s), io=2048MiB (2147MB), run=373607-373607msec
> >>>
> >>> Should I understand this correctly, the submission latency (slat) is
> >>> at minimum 41.5 seconds? I am experiencing problems with my SSD disk
> >>> (the performance is pretty low, which this seems to confirm), but now
> >>> I am wondering if this could be a problem with my OS and not my disk,
> >>> being the slat the submission latency?
> >>>
> >>> Thank you
> >
>
> --
> Damien Le Moal
> Western Digital Research
>
>

-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.johnson@xxxxxxxxxxxxxxxxx
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage