Felix, You're using the psync ioengine with a queue depth of 1. This will slow zfs down a lot. Use libaio and a queue depth of 8-16. To temporarily disable ZFS arc caching on your volume, say your benchmark volume is mypool/myvol mounted at /myvol you can do: zfs get all mypool/myvol > ~/myvol-settings-save.txt zfs set primarycache=off mypool/myvol zfs set secondarycache=off mypool/myvol then run your fio tests. After you are done benchmarking your ZFS refer to the settings saved in ~/myvol-settings-save.txt to set primarycache and secondarycache back to original. --Jeff On Thu, Apr 4, 2024 at 10:41 PM Felix Rubio <felix@xxxxxxxxx> wrote: > > Hi Jeff, > > I have done the test you suggested, and this is what came up: > > read_throughput_rpool: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) > 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1 > fio-3.33 > Starting 1 process > read_throughput_rpool: Laying out IO file (1 file / 1024MiB) > Jobs: 1 (f=1): [R(1)][100.0%][r=904MiB/s][r=903 IOPS][eta 00m:00s] > read_throughput_rpool: (groupid=0, jobs=1): err= 0: pid=3284411: Fri Apr > 5 07:31:18 2024 > read: IOPS=1092, BW=1092MiB/s (1145MB/s)(64.0GiB/60001msec) > clat (usec): min=733, max=6031, avg=903.71, stdev=166.68 > lat (usec): min=734, max=6032, avg=904.88, stdev=166.84 > clat percentiles (usec): > | 1.00th=[ 742], 5.00th=[ 750], 10.00th=[ 750], 20.00th=[ > 758], > | 30.00th=[ 766], 40.00th=[ 791], 50.00th=[ 824], 60.00th=[ > 873], > | 70.00th=[ 1037], 80.00th=[ 1106], 90.00th=[ 1139], 95.00th=[ > 1188], > | 99.00th=[ 1254], 99.50th=[ 1287], 99.90th=[ 1401], 99.95th=[ > 1500], > | 99.99th=[ 2474] > bw ( MiB/s): min= 871, max= 1270, per=100.00%, avg=1092.97, > stdev=166.88, samples=120 > iops : min= 871, max= 1270, avg=1092.78, stdev=166.95, > samples=120 > lat (usec) : 750=6.17%, 1000=60.30% > lat (msec) : 2=33.51%, 4=0.02%, 10=0.01% > cpu : usr=2.04%, sys=97.52%, ctx=2457, majf=0, minf=37 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=65543,0,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > READ: bw=1092MiB/s (1145MB/s), 1092MiB/s-1092MiB/s > (1145MB/s-1145MB/s), io=64.0GiB (68.7GB), run=60001-60001msec > > I take that the file was being cached (because the 1145 MBps on a single > SSD disk is hard to believe)? For the write test, these are the results: > > write_throughput_rpool: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) > 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1 > fio-3.33 > Starting 1 process > write_throughput_rpool: Laying out IO file (1 file / 1024MiB) > Jobs: 1 (f=1): [W(1)][100.0%][w=4100KiB/s][w=4 IOPS][eta 00m:00s] > write_throughput_rpool: (groupid=0, jobs=1): err= 0: pid=3292023: Fri > Apr 5 07:36:11 2024 > write: IOPS=5, BW=5394KiB/s (5524kB/s)(317MiB/60174msec); 0 zone > resets > clat (msec): min=16, max=1063, avg=189.59, stdev=137.19 > lat (msec): min=16, max=1063, avg=189.81, stdev=137.19 > clat percentiles (msec): > | 1.00th=[ 18], 5.00th=[ 22], 10.00th=[ 41], 20.00th=[ > 62], > | 30.00th=[ 129], 40.00th=[ 144], 50.00th=[ 163], 60.00th=[ > 190], > | 70.00th=[ 228], 80.00th=[ 268], 90.00th=[ 363], 95.00th=[ > 489], > | 99.00th=[ 567], 99.50th=[ 592], 99.90th=[ 1062], 99.95th=[ > 1062], > | 99.99th=[ 1062] > bw ( KiB/s): min= 2048, max=41042, per=100.00%, avg=5534.14, > stdev=5269.61, samples=117 > iops : min= 2, max= 40, avg= 5.40, stdev= 5.14, > samples=117 > lat (msec) : 20=3.47%, 50=11.99%, 100=8.83%, 250=51.42%, 500=20.19% > lat (msec) : 750=3.79%, 2000=0.32% > cpu : usr=0.11%, sys=0.52%, ctx=2673, majf=0, minf=38 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=0,317,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > WRITE: bw=5394KiB/s (5524kB/s), 5394KiB/s-5394KiB/s > (5524kB/s-5524kB/s), io=317MiB (332MB), run=60174-60174msec > > Still at 6 MBps. Should move this discussion to the openzfs team, or is > there anything else still worth testing? > > Thank you very much! > > --- > Felix Rubio > "Don't believe what you're told. Double check." > > On 2024-04-04 21:07, Jeff Johnson wrote: > > Felix, > > > > Based on your previous emails about the drive it would appear that the > > hardware (ssd, cables, port) are fine and the drive performs. > > > > Go back and run your original ZFS test on your mounted ZFS volume > > directory and remove the "--direct=1" from your command as ZFS does > > not yet support direct_io and disabling buffered io to the zfs > > directory will have very negative impacts. This is a ZFS thing, not > > your kernel or hardware. > > > > --Jeff > > > > On Thu, Apr 4, 2024 at 12:00 PM Felix Rubio <felix@xxxxxxxxx> wrote: > >> > >> hey Jeff, > >> > >> Good catch! I have run the following command: > >> > >> fio --name=seqread --numjobs=1 --time_based --runtime=60s > >> --ramp_time=2s > >> --iodepth=8 --ioengine=libaio --direct=1 --verify=0 > >> --group_reporting=1 > >> --bs=1M --rw=read --size=1G --filename=/dev/sda > >> > >> (/sda and /sdd are the drives I have on one of the pools), and this is > >> what I get: > >> > >> seqread: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, > >> (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=8 > >> fio-3.33 > >> Starting 1 process > >> Jobs: 1 (f=1): [R(1)][100.0%][r=388MiB/s][r=388 IOPS][eta 00m:00s] > >> seqread: (groupid=0, jobs=1): err= 0: pid=2368687: Thu Apr 4 20:56:06 > >> 2024 > >> read: IOPS=382, BW=383MiB/s (401MB/s)(22.4GiB/60020msec) > >> slat (usec): min=17, max=3098, avg=68.94, stdev=46.04 > >> clat (msec): min=14, max=367, avg=20.84, stdev= 6.61 > >> lat (msec): min=15, max=367, avg=20.91, stdev= 6.61 > >> clat percentiles (msec): > >> | 1.00th=[ 21], 5.00th=[ 21], 10.00th=[ 21], 20.00th=[ > >> 21], > >> | 30.00th=[ 21], 40.00th=[ 21], 50.00th=[ 21], 60.00th=[ > >> 21], > >> | 70.00th=[ 21], 80.00th=[ 21], 90.00th=[ 21], 95.00th=[ > >> 21], > >> | 99.00th=[ 25], 99.50th=[ 31], 99.90th=[ 48], 99.95th=[ > >> 50], > >> | 99.99th=[ 368] > >> bw ( KiB/s): min=215040, max=399360, per=100.00%, avg=392047.06, > >> stdev=19902.89, samples=120 > >> iops : min= 210, max= 390, avg=382.55, stdev=19.43, > >> samples=120 > >> lat (msec) : 20=0.19%, 50=99.80%, 100=0.01%, 500=0.03% > >> cpu : usr=0.39%, sys=1.93%, ctx=45947, majf=0, minf=37 > >> IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=100.0%, 16=0.0%, 32=0.0%, > >> >=64=0.0% > >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >> >=64=0.0% > >> complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, > >> >=64=0.0% > >> issued rwts: total=22954,0,0,0 short=0,0,0,0 dropped=0,0,0,0 > >> latency : target=0, window=0, percentile=100.00%, depth=8 > >> > >> Run status group 0 (all jobs): > >> READ: bw=383MiB/s (401MB/s), 383MiB/s-383MiB/s (401MB/s-401MB/s), > >> io=22.4GiB (24.1GB), run=60020-60020msec > >> > >> Disk stats (read/write): > >> sda: ios=23817/315, merge=0/0, ticks=549704/132687, > >> in_queue=683613, > >> util=99.93% > >> > >> 400 MBps!!! This is a number I have never experienced. I understand > >> this > >> means I need to go back to the openzfs chat/forum? > >> > >> --- > >> Felix Rubio > >> "Don't believe what you're told. Double check." > >> > > > > > > -- > > ------------------------------ > > Jeff Johnson > > Co-Founder > > Aeon Computing > > > > jeff.johnson@xxxxxxxxxxxxxxxxx > > www.aeoncomputing.com > > t: 858-412-3810 x1001 f: 858-412-3845 > > m: 619-204-9061 > > > > 4170 Morena Boulevard, Suite C - San Diego, CA 92117 > > > > High-Performance Computing / Lustre Filesystems / Scale-out Storage > -- ------------------------------ Jeff Johnson Co-Founder Aeon Computing jeff.johnson@xxxxxxxxxxxxxxxxx www.aeoncomputing.com t: 858-412-3810 x1001 f: 858-412-3845 m: 619-204-9061 4170 Morena Boulevard, Suite C - San Diego, CA 92117 High-Performance Computing / Lustre Filesystems / Scale-out Storage