Hi, This patch series adds DAX support to virtiofs filesystem. This allows bypassing guest page cache and allows mapping host page cache directly in guest address space. When a page of file is needed, guest sends a request to map that page (in host page cache) in qemu address space. Inside guest this is a physical memory range controlled by virtiofs device. And guest directly maps this physical address range using DAX and hence gets access to file data on host. This can speed up things considerably in many situations. Also this can result in substantial memory savings as file data does not have to be copied in guest and it is directly accessed from host page cache. Most of the changes are limited to fuse/virtiofs. There are couple of changes needed in generic dax infrastructure and couple of changes in virtio to be able to access shared memory region. These patches apply on top of 5.6-rc4 and are also available here. https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020 Any review or feedback is welcome. Performance =========== I have basically run bunch of fio jobs to get a sense of speed of various operations. I wrote a simple wrapper script to run fio jobs 3 times and take their average and report it. These scripts and fio jobs are available here. https://github.com/rhvgoyal/virtiofs-tests I set up a directory on ramfs on host and exported that directory inside guest using virtio-fs and ran tests inside guests. Ran tests with cache=none both with dax enabled and disabled. cache=none option enforces no caching happens in guest both for data and metadata. Test Setup ----------- - A fedora 29 host with 376Gi RAM, 2 sockets (20 cores per socket, 2 threads per core) - Using ramfs on host as backing store. 4 fio files of 8G each. - Created a VM with 64 VCPUS and 64GB memory. An 64GB cache window (for dax mmap). Test Results ------------ - Results in two configurations have been reported. virtio-fs (cache=none) and virtio-fs (cache=none + dax). There are other caching modes as well but to me cache=none seemed most interesting for now because it does not cache anything in guest and provides strong coherence. Other modes which provide less strong coherence and hence are faster are yet to be benchmarked. - Three fio ioengines psync, libaio and mmap have been used. - I/O Workload of randread, radwrite, seqread and seqwrite have been run. - Each file size is 8G. Block size 4K. iodepth=16 - "multi" means same operation was done with 4 jobs and each job is operating on a file of size 8G. - Some results are "0 (KiB/s)". That means that particular operation is not supported in that configuration. NAME I/O Operation BW(Read/Write) virtiofs-cache-none seqread-psync 35(MiB/s) virtiofs-cache-none-dax seqread-psync 643(MiB/s) virtiofs-cache-none seqread-psync-multi 219(MiB/s) virtiofs-cache-none-dax seqread-psync-multi 2132(MiB/s) virtiofs-cache-none seqread-mmap 0(KiB/s) virtiofs-cache-none-dax seqread-mmap 741(MiB/s) virtiofs-cache-none seqread-mmap-multi 0(KiB/s) virtiofs-cache-none-dax seqread-mmap-multi 2530(MiB/s) virtiofs-cache-none seqread-libaio 293(MiB/s) virtiofs-cache-none-dax seqread-libaio 425(MiB/s) virtiofs-cache-none seqread-libaio-multi 207(MiB/s) virtiofs-cache-none-dax seqread-libaio-multi 1543(MiB/s) virtiofs-cache-none randread-psync 36(MiB/s) virtiofs-cache-none-dax randread-psync 572(MiB/s) virtiofs-cache-none randread-psync-multi 211(MiB/s) virtiofs-cache-none-dax randread-psync-multi 1764(MiB/s) virtiofs-cache-none randread-mmap 0(KiB/s) virtiofs-cache-none-dax randread-mmap 719(MiB/s) virtiofs-cache-none randread-mmap-multi 0(KiB/s) virtiofs-cache-none-dax randread-mmap-multi 2005(MiB/s) virtiofs-cache-none randread-libaio 300(MiB/s) virtiofs-cache-none-dax randread-libaio 413(MiB/s) virtiofs-cache-none randread-libaio-multi 327(MiB/s) virtiofs-cache-none-dax randread-libaio-multi 1326(MiB/s) virtiofs-cache-none seqwrite-psync 34(MiB/s) virtiofs-cache-none-dax seqwrite-psync 494(MiB/s) virtiofs-cache-none seqwrite-psync-multi 223(MiB/s) virtiofs-cache-none-dax seqwrite-psync-multi 1680(MiB/s) virtiofs-cache-none seqwrite-mmap 0(KiB/s) virtiofs-cache-none-dax seqwrite-mmap 1217(MiB/s) virtiofs-cache-none seqwrite-mmap-multi 0(KiB/s) virtiofs-cache-none-dax seqwrite-mmap-multi 2359(MiB/s) virtiofs-cache-none seqwrite-libaio 282(MiB/s) virtiofs-cache-none-dax seqwrite-libaio 348(MiB/s) virtiofs-cache-none seqwrite-libaio-multi 320(MiB/s) virtiofs-cache-none-dax seqwrite-libaio-multi 1255(MiB/s) virtiofs-cache-none randwrite-psync 32(MiB/s) virtiofs-cache-none-dax randwrite-psync 458(MiB/s) virtiofs-cache-none randwrite-psync-multi 213(MiB/s) virtiofs-cache-none-dax randwrite-psync-multi 1343(MiB/s) virtiofs-cache-none randwrite-mmap 0(KiB/s) virtiofs-cache-none-dax randwrite-mmap 663(MiB/s) virtiofs-cache-none randwrite-mmap-multi 0(KiB/s) virtiofs-cache-none-dax randwrite-mmap-multi 1820(MiB/s) virtiofs-cache-none randwrite-libaio 292(MiB/s) virtiofs-cache-none-dax randwrite-libaio 341(MiB/s) virtiofs-cache-none randwrite-libaio-multi 322(MiB/s) virtiofs-cache-none-dax randwrite-libaio-multi 1094(MiB/s) Conclusion =========== - virtio-fs with dax enabled is significantly faster and memory effiecient as comapred to non-dax operation. Note: Right now dax window is 64G and max fio file size is 32G as well (4 files of 8G each). That means everything fits into dax window and no reclaim is needed. Dax window reclaim logic is slower and if file size is bigger than dax window size, performance slows down. Thanks Vivek Sebastien Boeuf (3): virtio: Add get_shm_region method virtio: Implement get_shm_region for PCI transport virtio: Implement get_shm_region for MMIO transport Stefan Hajnoczi (2): virtio_fs, dax: Set up virtio_fs dax_device fuse,dax: add DAX mmap support Vivek Goyal (15): dax: Modify bdev_dax_pgoff() to handle NULL bdev dax: Create a range version of dax_layout_busy_page() virtiofs: Provide a helper function for virtqueue initialization fuse: Get rid of no_mount_options fuse,virtiofs: Add a mount option to enable dax fuse,virtiofs: Keep a list of free dax memory ranges fuse: implement FUSE_INIT map_alignment field fuse: Introduce setupmapping/removemapping commands fuse, dax: Implement dax read/write operations fuse, dax: Take ->i_mmap_sem lock during dax page fault fuse,virtiofs: Define dax address space operations fuse,virtiofs: Maintain a list of busy elements fuse: Release file in process context fuse: Take inode lock for dax inode truncation fuse,virtiofs: Add logic to free up a memory range drivers/dax/super.c | 3 +- drivers/virtio/virtio_mmio.c | 32 + drivers/virtio/virtio_pci_modern.c | 107 +++ fs/dax.c | 66 +- fs/fuse/dir.c | 2 + fs/fuse/file.c | 1162 +++++++++++++++++++++++++++- fs/fuse/fuse_i.h | 109 ++- fs/fuse/inode.c | 148 +++- fs/fuse/virtio_fs.c | 250 +++++- include/linux/dax.h | 6 + include/linux/virtio_config.h | 17 + include/uapi/linux/fuse.h | 42 +- include/uapi/linux/virtio_fs.h | 3 + include/uapi/linux/virtio_mmio.h | 11 + include/uapi/linux/virtio_pci.h | 11 +- 15 files changed, 1888 insertions(+), 81 deletions(-) -- 2.20.1