From: Boaz Harrosh <boazh@xxxxxxxxxx> I would please like to present the ZUFS file system and the Kernel code part in this patchset. The Kernel code presented here can be found at: https://github.com/NetApp/zufs-zuf And the User-mode Server + example FSs here: https://github.com/NetApp/zufs-zus ZUFS - stands for Zero-copy User-mode FS * It is geared towards true zero copy end to end of both data and meta data. * It is geared towards very *low latency*, very high CPU locality, lock-less parallelism. * Synchronous operations (for low latency) * Numa awareness Short description: ZUFS is a from scratch implementation of a filesystem-in-user-space, which tries to address the above goals. from the get go it is aimed for pmem based FSs. But can easily support other type of FSs that can utilize x10 latency and parallelism improvements. The novelty of this project is that the interface is designed with a modern multi-core NUMA machine in mind down to the ABI, so to reach these goals. Please see first patch for License of this project Current status: There are a couple of trivial open-source filesystem implementations and a full blown proprietary implementation from Netapp. Together with the Kernel module submitted here the User-mode-Server and the zusFSs User-mode plugins, this code pass Netapp QA including xfstests + internal QA tests. And was released to costumers as Maxdata 1.2. So it is very stable. In the git repository above there is also a backport for rhel 7.6. Including rpm packages for Kernel and Server components. (Also available evaluation licenses of Maxdata 1.2 for developers. Please contact Amit Golander <Amit.Golander@xxxxxxxxxx> if you need one) Just to get some points across as I said this project is all about performance and low latency. Here below are some results I have run: [fuse] threads wr_iops wr_bw wr_lat 1 33606 134424 26.53226 2 57056 228224 30.38476 3 73142 292571 35.75727 4 88667 354668 40.12783 5 102280 409122 42.13261 6 110122 440488 48.29697 7 116561 466245 53.98572 8 129134 516539 55.6134 [fuse-splice] threads wr_iops wr_bw wr_lat 1 39670 158682 21.8399 2 51100 204400 34.63294 3 62385 249542 39.28847 4 75220 300882 47.42344 5 84522 338088 52.97299 6 93042 372168 57.40804 7 97706 390825 63.04435 8 98034 392137 73.24263 [xfs-dax] threads wr_iops wr_bw wr_lat 1 19449 77799 48.03282 2 37704 150819 37.2343 3 55415 221663 30.59375 4 72285 289142 26.08636 5 90348 361392 23.89037 6 103696 414787 22.38045 7 120638 482552 21.38869 8 134157 536630 21.1426 [Maxdata-1.2-zufs] threads wr_iops wr_bw wr_lat 1 57506 230026 14.387113 2 98624 394498 16.790232 3 142276 569106 17.344622 4 187984 751936 17.527123 5 190304 761219 19.504314 6 221407 885628 20.862000 7 211579 846316 23.262040 8 246029 984116 24.630604 [*1] These good results are when an mm patch is applied which introduces VM_LOCAL_CPU flag that eliminates vm_zap_ptes from scheduling on all CPUs when creating a per-cpu VMA. This patch was not accepted by the Linux Kernel community and is not presented in this patchset. (Patch available for review on demand) But a few weeks from now I will submit some incremental changes to the code which will return the numbers to above, and even better for some benchmarks. (without the mm patch) I have used an 8 way KVM-qemu with 2 NUMA nodes. Running fio with 4k random writes O_DIRECT | O_SYNC to a DRAM simulated pmem. (memmap=! at grub), Fuse-fs was a memcpy same 4k null-FS fio was then run with more and more threads (see threads column) to test for scalability. We are still > x2 slower than I would like to. (Compared to an in-kernel pmem-base FS) But I believe I can shave off another 1-2 us by farther optimizing the app-to-server thread switch by developing a new scheduler-object so to avoid going through the scheduler all together (and its locks) when switching VMs. (Currently using couple of wait_queue_head_t with wait_event() calls See relay.h in patches) Please Review and ask any question big or trivial. I would love to iron this code, and submit it upstream. Thank you for reading Boaz ~~~~~~~~~~~~~~~~~~ Boaz Harrosh (17): fs: Add the ZUF filesystem to the build + License zuf: Preliminary Documentation zuf: zuf-rootfs zuf: zuf-core The ZTs zuf: Multy Devices zuf: mounting zuf: Namei and directory operations zuf: readdir operation zuf: symlink zuf: More file operation zuf: Write/Read implementation zuf: mmap & sync zuf: ioctl implementation zuf: xattr implementation zuf: ACL support zuf: Special IOCTL fadvise (TODO) zuf: Support for dynamic-debug of zusFSs Documentation/filesystems/zufs.txt | 351 ++++++++ fs/Kconfig | 1 + fs/Makefile | 1 + fs/zuf/Kconfig | 23 + fs/zuf/Makefile | 23 + fs/zuf/_extern.h | 166 ++++ fs/zuf/_pr.h | 62 ++ fs/zuf/acl.c | 281 +++++++ fs/zuf/directory.c | 167 ++++ fs/zuf/file.c | 527 ++++++++++++ fs/zuf/inode.c | 648 ++++++++++++++ fs/zuf/ioctl.c | 306 +++++++ fs/zuf/md.c | 761 +++++++++++++++++ fs/zuf/md.h | 318 +++++++ fs/zuf/md_def.h | 145 ++++ fs/zuf/mmap.c | 336 ++++++++ fs/zuf/module.c | 28 + fs/zuf/namei.c | 435 ++++++++++ fs/zuf/relay.h | 88 ++ fs/zuf/rw.c | 705 ++++++++++++++++ fs/zuf/super.c | 771 +++++++++++++++++ fs/zuf/symlink.c | 74 ++ fs/zuf/t1.c | 138 +++ fs/zuf/t2.c | 375 +++++++++ fs/zuf/t2.h | 68 ++ fs/zuf/xattr.c | 310 +++++++ fs/zuf/zuf-core.c | 1257 ++++++++++++++++++++++++++++ fs/zuf/zuf-root.c | 431 ++++++++++ fs/zuf/zuf.h | 414 +++++++++ fs/zuf/zus_api.h | 869 +++++++++++++++++++ 30 files changed, 10079 insertions(+) create mode 100644 Documentation/filesystems/zufs.txt create mode 100644 fs/zuf/Kconfig create mode 100644 fs/zuf/Makefile create mode 100644 fs/zuf/_extern.h create mode 100644 fs/zuf/_pr.h create mode 100644 fs/zuf/acl.c create mode 100644 fs/zuf/directory.c create mode 100644 fs/zuf/file.c create mode 100644 fs/zuf/inode.c create mode 100644 fs/zuf/ioctl.c create mode 100644 fs/zuf/md.c create mode 100644 fs/zuf/md.h create mode 100644 fs/zuf/md_def.h create mode 100644 fs/zuf/mmap.c create mode 100644 fs/zuf/module.c create mode 100644 fs/zuf/namei.c create mode 100644 fs/zuf/relay.h create mode 100644 fs/zuf/rw.c create mode 100644 fs/zuf/super.c create mode 100644 fs/zuf/symlink.c create mode 100644 fs/zuf/t1.c create mode 100644 fs/zuf/t2.c create mode 100644 fs/zuf/t2.h create mode 100644 fs/zuf/xattr.c create mode 100644 fs/zuf/zuf-core.c create mode 100644 fs/zuf/zuf-root.c create mode 100644 fs/zuf/zuf.h create mode 100644 fs/zuf/zus_api.h -- 2.20.1