On Tue, Sep 15, 2020 at 5:35 AM Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: > > Hi > > I am developing a new filesystem suitable for persistent memory - nvfs. Nice! > The goal is to have a small and fast filesystem that can be used on > DAX-based devices. Nvfs maps the whole device into linear address space > and it completely bypasses the overhead of the block layer and buffer > cache. So does device-dax, but device-dax lacks read(2)/write(2). > In the past, there was nova filesystem for pmem, but it was abandoned a > year ago (the last version is for the kernel 5.1 - > https://github.com/NVSL/linux-nova ). Nvfs is smaller and performs better. > > The design of nvfs is similar to ext2/ext4, so that it fits into the VFS > layer naturally, without too much glue code. > > I'd like to ask you to review it. > > > tarballs: > http://people.redhat.com/~mpatocka/nvfs/ > git: > git://leontynka.twibright.com/nvfs.git > the description of filesystem internals: > http://people.redhat.com/~mpatocka/nvfs/INTERNALS > benchmarks: > http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS > > > TODO: > > - programs run approximately 4% slower when running from Optane-based > persistent memory. Therefore, programs and libraries should use page cache > and not DAX mapping. This needs to be based on platform firmware data f(ACPI HMAT) for the relative performance of a PMEM range vs DRAM. For example, this tradeoff should not exist with battery backed DRAM, or virtio-pmem. > > - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses > buffer cache for the mapping. The buffer cache slows does fsck by a factor > of 5 to 10. Could it be possible to change the kernel so that it maps DAX > based block devices directly? We've been down this path before. 5a023cdba50c block: enable dax for raw block devices 9f4736fe7ca8 block: revert runtime dax control of the raw block device acc93d30d7d4 Revert "block: enable dax for raw block devices" EXT2/4 metadata buffer management depends on the page cache and we eliminated a class of bugs by removing that support. The problems are likely tractable, but there was not a straightforward fix visible at the time. > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and > trailing bytes. You want copy_user_flushcache(). See how fs/dax.c arranges for dax_copy_from_iter() to route to pmem_copy_from_iter().