On 24/02/23 04:07PM, Luis Chamberlain wrote: > On Fri, Feb 23, 2024 at 11:41:44AM -0600, John Groves wrote: > > This patch set introduces famfs[1] - a special-purpose fs-dax file system > > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not > > CXL-specific in anyway way. > > > > * Famfs creates a simple access method for storing and sharing data in > > sharable memory. The memory is exposed and accessed as memory-mappable > > dax files. > > * Famfs supports multiple hosts mounting the same file system from the > > same memory (something existing fs-dax file systems don't do). > > * A famfs file system can be created on either a /dev/pmem device in fs-dax > > mode, or a /dev/dax device in devdax mode (the latter depending on > > patches 2-6 of this series). > > > > The famfs kernel file system is part the famfs framework; additional > > components in user space[2] handle metadata and direct the famfs kernel > > module to instantiate files that map to specific memory. The famfs user > > space has documentation and a reasonably thorough test suite. > > > > The famfs kernel module never accesses the shared memory directly (either > > data or metadata). Because of this, shared memory managed by the famfs > > framework does not create a RAS "blast radius" problem that should be able > > to crash or de-stabilize the kernel. Poison or timeouts in famfs memory > > can be expected to kill apps via SIGBUS and cause mounts to be disabled > > due to memory failure notifications. > > > > Famfs does not attempt to solve concurrency or coherency problems for apps, > > although it does solve these problems in regard to its own data structures. > > Apps may encounter hard concurrency problems, but there are use cases that > > are imminently useful and uncomplicated from a concurrency perspective: > > serial sharing is one (only one host at a time has access), and read-only > > concurrent sharing is another (all hosts can read-cache without worry). > > Can you do me a favor, curious if you can run a test like this: > > fio -name=ten-1g-per-thread --nrfiles=10 -bs=2M -ioengine=io_uring > -direct=1 > --group_reporting=1 --alloc-size=1048576 --filesize=1GiB > --readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1 > --directory=/mnt > > What do you get for throughput? > > The absolute large the system an capacity the better. > > Luis Luis, First, thanks for paying attention. I think I need to clarify a few things about famfs and then check how that modifies your ask; apologies if some are obvious. You should tell me whether this is still interesting given these clarifications and limitations, or if there is something else you'd like to see tested instead. But read on, I have run the closest tests I can. Famfs files just map to dax memory; they don't have a backing store. So the io_uring and direct=1 options don't work. The coolness is that the files & memory can be shared, and that apps can deal with files rather than having to learn new abstractions. Famfs files are never allocate-on-write, so (--fallocate=none is ok, but "actual" fallocate doesn't work - and --create_on_open desn't work). But it seems to be happy if I preallocate the files for the test. I don't currently have custody of a really beefy system (can get one, just need to plan ahead). My primary dev system is a 48 HT core E5-2690 v3 @ 2.60G (around 10 years old). I have a 128GB dax device that is backed by ddr4 via efi_fake_mem. So I can't do 48 x 10 x 1G, but I can do 48 x 10 x 256M. I ran this on ddr4-backed famfs, and xfs backed by a sata ssd. Probably not fair, but it's what I have on a Sunday evening. I can get access to a beefy system with real cxl memory, though don't assume 100% I can report performance on that - will check into that. But think about what you're looking for in light of the fact that famfs is just a shared-memory file system, so no O_DIRECT or io_uring. Basically just (hopefully efficient) vma fault handling and metadata distribution. ### Here is famfs. I had to drop the io_uring and script up alloc/creation of the files (sudo famfs creat -s 256M /mnt/famfs/foo) $ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=100MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --directory=/mnt/famfs ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=psync, iodepth=1 ... fio-3.33 Starting 48 processes Jobs: 40 (f=400) ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=201738: Mon Feb 26 06:48:21 2024 write: IOPS=15.2k, BW=29.6GiB/s (31.8GB/s)(44.7GiB/1511msec); 0 zone resets clat (usec): min=156, max=54645, avg=2077.40, stdev=1730.77 lat (usec): min=171, max=54686, avg=2404.87, stdev=2056.50 clat percentiles (usec): | 1.00th=[ 196], 5.00th=[ 243], 10.00th=[ 367], 20.00th=[ 644], | 30.00th=[ 857], 40.00th=[ 1352], 50.00th=[ 1876], 60.00th=[ 2442], | 70.00th=[ 2868], 80.00th=[ 3228], 90.00th=[ 3884], 95.00th=[ 4555], | 99.00th=[ 6390], 99.50th=[ 7439], 99.90th=[16450], 99.95th=[23987], | 99.99th=[46924] bw ( MiB/s): min=21544, max=28034, per=81.80%, avg=24789.35, stdev=130.16, samples=81 iops : min=10756, max=14000, avg=12378.00, stdev=65.06, samples=81 lat (usec) : 250=5.42%, 500=9.67%, 750=8.07%, 1000=11.77% lat (msec) : 2=16.87%, 4=39.59%, 10=8.37%, 20=0.17%, 50=0.07% lat (msec) : 100=0.01% cpu : usr=13.26%, sys=81.62%, ctx=2075, majf=0, minf=18159 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,22896,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec $ sudo famfs fsck -h /mnt/famfs Famfs Superblock: Filesystem UUID: 591f3f62-0a79-4543-9ab5-e02dc807c76c System UUID: 00000000-0000-0000-0000-0cc47aaaa734 sizeof superblock: 168 num_daxdevs: 1 primary: /dev/dax1.0 137438953472 Log stats: # of log entriesi in use: 480 of 25575 Log size in use: 157488 No allocation errors found Capacity: Device capacity: 128.00G Bitmap capacity: 127.99G Sum of file sizes: 120.00G Allocated space: 120.00G Free space: 7.99G Space amplification: 1.00 Percent used: 93.8% Famfs log: 480 of 25575 entries used 480 files 0 directories ### Here is the same fio command, plus --ioengine=io_uring and --direct=1. It's apples and oranges, since famfs is a memory interface and not a storage interface. This is run on an xfs file system on a SATA ssd. Note units are msec here, usec above. fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/home/jmg/t1 ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1 ... fio-3.33 Starting 48 processes ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) Jobs: 37 (f=370): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(13),_(1),W(5),_(1),W(5)][72.1%][w=454MiB/s][w=227 IOPS][eta 01m:32sJobs: 37 (f=370): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(13),_(1),W(5),_(1),W(5)][72.4%][w=456MiB/s][w=228 IOPS][eta 01m:31sJobs: 36 (f=360): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(3),W(13),_(1),W(5),_(1),W(5)][72.9%][w=454MiB/s][w=227 IOPS][eta 01m:29s] Jobs: 33 (f=330): [_(3),W(2),_(1),W(1),_(1),W(1),_(1),W(4),_(1),W(1),_(1),W(1),_(1),W(1),_(3),W(13),_(1),W(5),_(1),W(2),_(1),W(2)][73.0%][w=458MiB/s][w=229 IOPS][eta 01Jobs: 30 (f=300): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(3),_(1),W(1),_(3),W(1),_(3),W(7),_(1),W(5),_(1),W(5),_(1),W(2),_(1),W(2)][73.6%][w=462MiB/s][w=231 IOPS][eta 01mJobs: 28 (f=280): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(3),_(5),W(1),_(3),W(7),_(1),W(5),_(1),W(5),_(1),W(2),_(2),W(1)][74.1%][w=456MiB/s][w=228 IOPS][eta 01m:25s] Jobs: 25 (f=250): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),W(1),_(3),W(2),_(1),W(4),_(1),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][75.1%][w=458MiB/s][w=229 IOPJobs: 24 (f=240): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),W(1),_(3),W(2),_(1),W(3),_(2),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][75.6%][w=456MiB/s][w=228 IOPJobs: 23 (f=230): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),E(1),_(3),W(2),_(1),W(3),_(2),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][76.2%][w=452MiB/s][w=226 IOPJobs: 20 (f=200): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(1),W(3),_(1),W(1),_(2),W(1),_(3)][76.7%][w=448MiB/s][w=224 IOPS][eta 01m:15sJobs: 19 (f=190): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(2),W(2),_(1),W(1),_(2),W(1),_(3)][77.5%][w=464MiB/s][w=232 IOPS][eta 01m:12sJobs: 18 (f=180): [_(3),W(2),_(3),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(2),W(2),_(1),W(1),_(2),W(1),_(3)][78.8%][w=478MiB/s][w=239 IOPS][eta 01m:07s] Jobs: 4 (f=40): [_(3),W(1),_(22),W(1),_(12),W(1),_(4),W(1),_(3)][92.4%][w=462MiB/s][w=231 IOPS][eta 00m:21s] ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=210709: Mon Feb 26 07:20:51 2024 write: IOPS=228, BW=458MiB/s (480MB/s)(114GiB/255942msec); 0 zone resets slat (usec): min=39, max=776, avg=186.65, stdev=49.13 clat (msec): min=4, max=6718, avg=199.27, stdev=324.82 lat (msec): min=4, max=6718, avg=199.45, stdev=324.82 clat percentiles (msec): | 1.00th=[ 30], 5.00th=[ 47], 10.00th=[ 60], 20.00th=[ 69], | 30.00th=[ 78], 40.00th=[ 85], 50.00th=[ 95], 60.00th=[ 114], | 70.00th=[ 142], 80.00th=[ 194], 90.00th=[ 409], 95.00th=[ 810], | 99.00th=[ 1703], 99.50th=[ 2140], 99.90th=[ 3037], 99.95th=[ 3440], | 99.99th=[ 4665] bw ( KiB/s): min=195570, max=2422953, per=100.00%, avg=653513.53, stdev=8137.30, samples=17556 iops : min= 60, max= 1180, avg=314.22, stdev= 3.98, samples=17556 lat (msec) : 10=0.11%, 20=0.37%, 50=5.35%, 100=47.30%, 250=32.22% lat (msec) : 500=6.11%, 750=2.98%, 1000=1.98%, 2000=2.97%, >=2000=0.60% cpu : usr=0.10%, sys=0.01%, ctx=58709, majf=0, minf=669 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=458MiB/s (480MB/s), 458MiB/s-458MiB/s (480MB/s-480MB/s), io=114GiB (123GB), run=255942-255942msec Disk stats (read/write): dm-2: ios=11/82263, merge=0/0, ticks=270/13403617, in_queue=13403887, util=97.10%, aggrios=11/152359, aggrmerge=0/5087, aggrticks=271/11493029, aggrin_queue=11494994, aggrutil=100.00% sdb: ios=11/152359, merge=0/5087, ticks=271/11493029, in_queue=11494994, util=100.00% ### Let me know what else you'd like to see tried. Regards, John