Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

John Groves <John@xxxxxxxxxx> · Mon, 26 Feb 2024 07:27:18 -0600

On 24/02/23 04:07PM, Luis Chamberlain wrote:
> On Fri, Feb 23, 2024 at 11:41:44AM -0600, John Groves wrote:
> > This patch set introduces famfs[1] - a special-purpose fs-dax file system
> > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> > CXL-specific in anyway way.
> > 
> > * Famfs creates a simple access method for storing and sharing data in
> >   sharable memory. The memory is exposed and accessed as memory-mappable
> >   dax files.
> > * Famfs supports multiple hosts mounting the same file system from the
> >   same memory (something existing fs-dax file systems don't do).
> > * A famfs file system can be created on either a /dev/pmem device in fs-dax
> >   mode, or a /dev/dax device in devdax mode (the latter depending on
> >   patches 2-6 of this series).
> > 
> > The famfs kernel file system is part the famfs framework; additional
> > components in user space[2] handle metadata and direct the famfs kernel
> > module to instantiate files that map to specific memory. The famfs user
> > space has documentation and a reasonably thorough test suite.
> > 
> > The famfs kernel module never accesses the shared memory directly (either
> > data or metadata). Because of this, shared memory managed by the famfs
> > framework does not create a RAS "blast radius" problem that should be able
> > to crash or de-stabilize the kernel. Poison or timeouts in famfs memory
> > can be expected to kill apps via SIGBUS and cause mounts to be disabled
> > due to memory failure notifications.
> > 
> > Famfs does not attempt to solve concurrency or coherency problems for apps,
> > although it does solve these problems in regard to its own data structures.
> > Apps may encounter hard concurrency problems, but there are use cases that
> > are imminently useful and uncomplicated from a concurrency perspective:
> > serial sharing is one (only one host at a time has access), and read-only
> > concurrent sharing is another (all hosts can read-cache without worry).
> 
> Can you do me a favor, curious if you can run a test like this:
> 
> fio -name=ten-1g-per-thread --nrfiles=10 -bs=2M -ioengine=io_uring                                                                                                                            
> -direct=1                                                                                                                                                                                    
> --group_reporting=1 --alloc-size=1048576 --filesize=1GiB                                                                                                                                      
> --readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1                                                                                                                      
> --directory=/mnt 
> 
> What do you get for throughput?
> 
> The absolute large the system an capacity the better.
> 
>   Luis

Luis,

First, thanks for paying attention. I think I need to clarify a few things
about famfs and then check how that modifies your ask; apologies if some
are obvious. You should tell me whether this is still interesting given
these clarifications and limitations, or if there is something else you'd
like to see tested instead. But read on, I have run the closest tests I
can.

Famfs files just map to dax memory; they don't have a backing store. So the
io_uring and direct=1 options don't work. The coolness is that the files &
memory can be shared, and that apps can deal with files rather than having
to learn new abstractions.

Famfs files are never allocate-on-write, so (--fallocate=none is ok, but
"actual" fallocate doesn't work - and --create_on_open desn't work). But it
seems to be happy if I preallocate the files for the test.

I don't currently have custody of a really beefy system (can get one, just
need to plan ahead). My primary dev system is a 48 HT core E5-2690 v3 @
2.60G (around 10 years old).

I have a 128GB dax device that is backed by ddr4 via efi_fake_mem. So I
can't do 48 x 10 x 1G, but I can do 48 x 10 x 256M. I ran this on
ddr4-backed famfs, and xfs backed by a sata ssd. Probably not fair, but
it's what I have on a Sunday evening.

I can get access to a beefy system with real cxl memory, though don't
assume 100% I can report performance on that - will check into that. But
think about what you're looking for in light of the fact that famfs is just
a shared-memory file system, so no O_DIRECT or io_uring. Basically just
(hopefully efficient) vma fault handling and metadata distribution.

###

Here is famfs. I had to drop the io_uring and script up alloc/creation
of the files (sudo famfs creat -s 256M /mnt/famfs/foo)

$ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=100MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --directory=/mnt/famfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=psync, iodepth=1
...
fio-3.33
Starting 48 processes
Jobs: 40 (f=400)
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=201738: Mon Feb 26 06:48:21 2024
  write: IOPS=15.2k, BW=29.6GiB/s (31.8GB/s)(44.7GiB/1511msec); 0 zone resets
    clat (usec): min=156, max=54645, avg=2077.40, stdev=1730.77
     lat (usec): min=171, max=54686, avg=2404.87, stdev=2056.50
    clat percentiles (usec):
     |  1.00th=[  196],  5.00th=[  243], 10.00th=[  367], 20.00th=[  644],
     | 30.00th=[  857], 40.00th=[ 1352], 50.00th=[ 1876], 60.00th=[ 2442],
     | 70.00th=[ 2868], 80.00th=[ 3228], 90.00th=[ 3884], 95.00th=[ 4555],
     | 99.00th=[ 6390], 99.50th=[ 7439], 99.90th=[16450], 99.95th=[23987],
     | 99.99th=[46924]
   bw (  MiB/s): min=21544, max=28034, per=81.80%, avg=24789.35, stdev=130.16, samples=81
   iops        : min=10756, max=14000, avg=12378.00, stdev=65.06, samples=81
  lat (usec)   : 250=5.42%, 500=9.67%, 750=8.07%, 1000=11.77%
  lat (msec)   : 2=16.87%, 4=39.59%, 10=8.37%, 20=0.17%, 50=0.07%
  lat (msec)   : 100=0.01%
  cpu          : usr=13.26%, sys=81.62%, ctx=2075, majf=0, minf=18159
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,22896,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec

$ sudo famfs fsck -h /mnt/famfs
Famfs Superblock:
  Filesystem UUID: 591f3f62-0a79-4543-9ab5-e02dc807c76c
  System UUID:     00000000-0000-0000-0000-0cc47aaaa734
  sizeof superblock: 168
  num_daxdevs:              1
  primary: /dev/dax1.0   137438953472

Log stats:
  # of log entriesi in use: 480 of 25575
  Log size in use:          157488
  No allocation errors found

Capacity:
  Device capacity:        128.00G
  Bitmap capacity:        127.99G
  Sum of file sizes:      120.00G
  Allocated space:        120.00G
  Free space:             7.99G
  Space amplification:     1.00
  Percent used:            93.8%

Famfs log:
  480 of 25575 entries used
  480 files
  0 directories

###

Here is the same fio command, plus --ioengine=io_uring and --direct=1. It's
apples and oranges, since famfs is a memory interface and not a storage
interface. This is run on an xfs file system on a SATA ssd.

Note units are msec here, usec above.

fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/home/jmg/t1
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1
...
fio-3.33
Starting 48 processes
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
Jobs: 37 (f=370): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(13),_(1),W(5),_(1),W(5)][72.1%][w=454MiB/s][w=227 IOPS][eta 01m:32sJobs: 37 (f=370): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(13),_(1),W(5),_(1),W(5)][72.4%][w=456MiB/s][w=228 IOPS][eta 01m:31sJobs: 36 (f=360): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(3),W(13),_(1),W(5),_(1),W(5)][72.9%][w=454MiB/s][w=227 IOPS][eta 01m:29s]         Jobs: 33 (f=330): [_(3),W(2),_(1),W(1),_(1),W(1),_(1),W(4),_(1),W(1),_(1),W(1),_(1),W(1),_(3),W(13),_(1),W(5),_(1),W(2),_(1),W(2)][73.0%][w=458MiB/s][w=229 IOPS][eta 01Jobs: 30 (f=300): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(3),_(1),W(1),_(3),W(1),_(3),W(7),_(1),W(5),_(1),W(5),_(1),W(2),_(1),W(2)][73.6%][w=462MiB/s][w=231 IOPS][eta 01mJobs: 28 (f=280): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(3),_(5),W(1),_(3),W(7),_(1),W(5),_(1),W(5),_(1),W(2),_(2),W(1)][74.1%][w=456MiB/s][w=228 IOPS][eta 01m:25s]     Jobs: 25 (f=250): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),W(1),_(3),W(2),_(1),W(4),_(1),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][75.1%][w=458MiB/s][w=229 IOPJobs: 24 (f=240): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),W(1),_(3),W(2),_(1),W(3),_(2),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][75.6%][w=456MiB/s][w=228 IOPJobs: 23 (f=230): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),E(1),_(3),W(2),_(1),W(3),_(2),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][76.2%][w=452MiB/s][w=226 IOPJobs: 20 (f=200): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(1),W(3),_(1),W(1),_(2),W(1),_(3)][76.7%][w=448MiB/s][w=224 IOPS][eta 01m:15sJobs: 19 (f=190): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(2),W(2),_(1),W(1),_(2),W(1),_(3)][77.5%][w=464MiB/s][w=232 IOPS][eta 01m:12sJobs: 18 (f=180): [_(3),W(2),_(3),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(2),W(2),_(1),W(1),_(2),W(1),_(3)][78.8%][w=478MiB/s][w=239 IOPS][eta 01m:07s]         Jobs: 4 (f=40): [_(3),W(1),_(22),W(1),_(12),W(1),_(4),W(1),_(3)][92.4%][w=462MiB/s][w=231 IOPS][eta 00m:21s]                                                   
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=210709: Mon Feb 26 07:20:51 2024
  write: IOPS=228, BW=458MiB/s (480MB/s)(114GiB/255942msec); 0 zone resets
    slat (usec): min=39, max=776, avg=186.65, stdev=49.13
    clat (msec): min=4, max=6718, avg=199.27, stdev=324.82
     lat (msec): min=4, max=6718, avg=199.45, stdev=324.82
    clat percentiles (msec):
     |  1.00th=[   30],  5.00th=[   47], 10.00th=[   60], 20.00th=[   69],
     | 30.00th=[   78], 40.00th=[   85], 50.00th=[   95], 60.00th=[  114],
     | 70.00th=[  142], 80.00th=[  194], 90.00th=[  409], 95.00th=[  810],
     | 99.00th=[ 1703], 99.50th=[ 2140], 99.90th=[ 3037], 99.95th=[ 3440],
     | 99.99th=[ 4665]
   bw (  KiB/s): min=195570, max=2422953, per=100.00%, avg=653513.53, stdev=8137.30, samples=17556
   iops        : min=   60, max= 1180, avg=314.22, stdev= 3.98, samples=17556
  lat (msec)   : 10=0.11%, 20=0.37%, 50=5.35%, 100=47.30%, 250=32.22%
  lat (msec)   : 500=6.11%, 750=2.98%, 1000=1.98%, 2000=2.97%, >=2000=0.60%
  cpu          : usr=0.10%, sys=0.01%, ctx=58709, majf=0, minf=669
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=458MiB/s (480MB/s), 458MiB/s-458MiB/s (480MB/s-480MB/s), io=114GiB (123GB), run=255942-255942msec

Disk stats (read/write):
    dm-2: ios=11/82263, merge=0/0, ticks=270/13403617, in_queue=13403887, util=97.10%, aggrios=11/152359, aggrmerge=0/5087, aggrticks=271/11493029, aggrin_queue=11494994, aggrutil=100.00%
  sdb: ios=11/152359, merge=0/5087, ticks=271/11493029, in_queue=11494994, util=100.00%

###

Let me know what else you'd like to see tried.

Regards,
John