On Wed, 16 Sep 2020, Mikulas Patocka wrote: > > > On Wed, 16 Sep 2020, Dan Williams wrote: > > > On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: > > > > > > > My first question about nvfs is how it compares to a daxfs with > > > > executables and other binaries configured to use page cache with the > > > > new per-file dax facility? > > > > > > nvfs is faster than dax-based filesystems on metadata-heavy operations > > > because it doesn't have the overhead of the buffer cache and bios. See > > > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS > > > > ...and that metadata problem is intractable upstream? Christoph poked > > at bypassing the block layer for xfs metadata operations [1], I just > > have not had time to carry that further. > > > > [1]: "xfs: use dax_direct_access for log writes", although it seems > > he's dropped that branch from his xfs.git > > XFS is very big. I wanted to create something small. And the another difference is that XFS metadata are optimized for disks and SSDs. On disks and SSDs, reading one byte is as costly as reading a full block. So we must put as much information to a block as possible. XFS uses b+trees for file block mapping and for directories - it is reasonable decision because b+trees minimize the number of disk accesses. On persistent memory, each access has its own cost, so NVFS uses metadata structures that minimize the number of cache lines accessed (rather than the number of blocks accessed). For block mapping, NVFS uses the classic unix dierct/indirect blocks - if a file block is mapped by a 3-rd level indirect block, we do just three memory accesses and we are done. If we used b+trees, the number of accesses would be much larger than 3 (we would have to do binary search in the b+tree nodes). The same for directories - NVFS hashes the file name and uses radix-tree to locate a directory page where the directory entry is located. XFS b+trees would result in much more accesses than the radix-tree. Regarding journaling - NVFS doesn't do it because persistent memory is so fast that we can just check it in the case of crash. NVFS has a multithreaded fsck that can do 3 million inodes per second. XFS does journaling (it was reasonable decision for disks where fsck took hours) and it will cause overhead for all the filesystem operations. Mikulas