Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

John Groves <John@xxxxxxxxxx> · Thu, 29 Feb 2024 08:52:48 -0600

Hi Dave!

On 24/02/29 01:15PM, Dave Chinner wrote:
> On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote:
> > On 24/02/26 04:58PM, Luis Chamberlain wrote:
> > > On Mon, Feb 26, 2024 at 1:16 PM John Groves <John@xxxxxxxxxx> wrote:
> > > >
> > > > On 24/02/26 07:53AM, Luis Chamberlain wrote:
> > > > > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> > > > > > Run status group 0 (all jobs):
> > > > > >   WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec
> > > > >
> > > > > > This is run on an xfs file system on a SATA ssd.
> > > > >
> > > > > To compare more closer apples to apples, wouldn't it make more sense
> > > > > to try this with XFS on pmem (with fio -direct=1)?
> > > > >
> > > > >   Luis
> > > >
> > > > Makes sense. Here is the same command line I used with xfs before, but
> > > > now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem
> > > > because xfs requires that.
> > > >
> > > > fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
> > > 
> > > Could you try with mkfs.xfs -d agcount=1024
> 
> Won't change anything for the better, may make things worse.

I dropped that arg, though performance looked about the same either way.

> 
> >    bw (  MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719
> >    iops        : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719
> >   lat (usec)   : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02%
> >   lat (usec)   : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75%
> >   lat (msec)   : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25%
> 
> Most of the IO latencies are up round the 4-20ms marks. That seems
> kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB
> should only take a couple of hundred microseconds. For Famfs, the
> latencies appear to be around 1-4ms.
> 
> So where's all that extra time coming from?

Below, you will see two runs with performance and latency distribution
about the same as famfs (the answer for that was --fallocate=native).

> 
> 
> >   lat (msec)   : 100=0.08%
> >   cpu          : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511
> 
> And why is system time reporting at almost zero instead of almost
> all the remaining cpu time (i.e. up at 80-90%)?

Something weird is going on with the cpu reporting. Sometimes sys=~0, but other times
it's about what you would expect. I suspect some sort of measurement error,
like maybe the method doesn't work with my cpu model? (I'm grasping, but with
a somewhat rational basis...)

I pasted two xfs runs below. The first has the wonky cpu sys value, and
the second looks about like what one would expect.

> 
> Can you run call-graph kernel profiles for XFS and famfs whilst
> running this workload so we have some insight into what is behaving
> differently here?

Can you point me to an example of how to do that?

> 
> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx

I'd been thinking about the ~2x gap for a few days, and the most obvious
difference is famfs files must be preallocated (like fallocate, but works
a bit differently since allocation happens in user space). I just checked 
one of the xfs files, and it had maybe 80 extents (whereas the famfs 
files always have 1 extent here).

FWIW I ran xfs with and without io_uring, and there was no apparent
difference (which makes sense to me because it's not block I/O).

The prior ~2x gap still seems like a lot of overhead for extent list 
mapping to memory, but adding --fallocate=native to the xfs test brought 
it into line with famfs:

+ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=native --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1
...
fio-3.33
Starting 48 processes
Jobs: 38 (f=380): [W(5),_(1),W(12),_(1),W(3),_(1),W(2),_(1),W(2),_(1),W(1),_(1),W(1),_(1),W(6),_(1),W(6),_(2)][57.1%][w=28.0GiB/s][w=14.3k IOPS][eta 00m:03s]
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=1452590: Thu Feb 29 07:46:06 2024
  write: IOPS=15.3k, BW=29.8GiB/s (32.0GB/s)(114GiB/3838msec); 0 zone resets
    slat (usec): min=17, max=55364, avg=668.20, stdev=1120.41
    clat (nsec): min=1368, max=99619k, avg=1982477.32, stdev=2198309.32
     lat (usec): min=179, max=99813, avg=2650.68, stdev=2485.15
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[   14], 10.00th=[  172], 20.00th=[  420],
     | 30.00th=[  644], 40.00th=[ 1057], 50.00th=[ 1582], 60.00th=[ 2008],
     | 70.00th=[ 2343], 80.00th=[ 3097], 90.00th=[ 4555], 95.00th=[ 5473],
     | 99.00th=[ 8717], 99.50th=[11863], 99.90th=[20055], 99.95th=[27657],
     | 99.99th=[49546]
   bw (  MiB/s): min=20095, max=59216, per=100.00%, avg=35985.47, stdev=318.61, samples=280
   iops        : min=10031, max=29587, avg=17970.76, stdev=159.29, samples=280
  lat (usec)   : 2=0.06%, 4=1.02%, 10=2.33%, 20=4.29%, 50=1.85%
  lat (usec)   : 100=0.20%, 250=3.26%, 500=11.23%, 750=8.87%, 1000=5.82%
  lat (msec)   : 2=20.95%, 4=26.74%, 10=12.60%, 20=0.66%, 50=0.09%
  lat (msec)   : 100=0.01%
  cpu          : usr=15.48%, sys=1.17%, ctx=62654, majf=0, minf=22801
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=29.8GiB/s (32.0GB/s), 29.8GiB/s-29.8GiB/s (32.0GB/s-32.0GB/s), io=114GiB (123GB), run=3838-3838msec

Disk stats (read/write):
  pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

## Here is a run where the cpu looks "normal"

+ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=native --numjobs=48 --create_on_open=0 --direct=1 --directory=/mnt/xfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=psync, iodepth=1
...
fio-3.33
Starting 48 processes
Jobs: 19 (f=190): [W(2),_(1),W(2),_(8),W(1),_(3),W(1),_(1),W(2),_(2),W(1),_(1),W(3),_(2),W(1),_(1),W(1),_(2),W(2),_(7),W(3),_(1)][55.6%][w=26.7GiB/s][w=13.6k IOPS][eta 00m:04s]
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=1463615: Thu Feb 29 08:19:53 2024
  write: IOPS=12.4k, BW=24.1GiB/s (25.9GB/s)(114GiB/4736msec); 0 zone resets
    clat (usec): min=138, max=117903, avg=2581.99, stdev=2704.61
     lat (usec): min=152, max=120405, avg=3019.04, stdev=2964.47
    clat percentiles (usec):
     |  1.00th=[  161],  5.00th=[  249], 10.00th=[  627], 20.00th=[ 1270],
     | 30.00th=[ 1631], 40.00th=[ 1942], 50.00th=[ 2089], 60.00th=[ 2212],
     | 70.00th=[ 2343], 80.00th=[ 2704], 90.00th=[ 5866], 95.00th=[ 6849],
     | 99.00th=[12387], 99.50th=[14353], 99.90th=[26084], 99.95th=[38536],
     | 99.99th=[78119]
   bw (  MiB/s): min=21204, max=47040, per=100.00%, avg=29005.40, stdev=237.31, samples=329
   iops        : min=10577, max=23497, avg=14479.74, stdev=118.65, samples=329
  lat (usec)   : 250=5.04%, 500=4.03%, 750=2.37%, 1000=3.13%
  lat (msec)   : 2=29.39%, 4=41.05%, 10=13.37%, 20=1.45%, 50=0.15%
  lat (msec)   : 100=0.03%, 250=0.01%
  cpu          : usr=14.43%, sys=78.18%, ctx=5272, majf=0, minf=15708
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=24.1GiB/s (25.9GB/s), 24.1GiB/s-24.1GiB/s (25.9GB/s-25.9GB/s), io=114GiB (123GB), run=4736-4736msec

Disk stats (read/write):
  pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

Cheers,
John