> > On Wed, May 10, 2017 at 09:26:00PM +0530, Pankaj Gupta wrote: > > We are sharing initial project proposal for > > 'KVM "fake DAX" device flushing' project for feedback. > > Got the idea during discussion with 'Rik van Riel'. > > CCing NVDIMM folks. > > > > > Also, request answers to 'Questions' section. > > > > Abstract : > > ---------- > > Project idea is to use fake persistent memory with direct > > access(DAX) in virtual machines. Overall goal of project > > is to increase the number of virtual machines that can be > > run on a physical machine, in order to increase the density > > of customer virtual machines. > > > > The idea is to avoid the guest page cache, and minimize the > > memory footprint of virtual machines. By presenting a disk > > image as a nvdimm direct access (DAX) memory region in a > > virtual machine, the guest OS can avoid using page cache > > memory for most file accesses. > > > > Problem Statement : > > ------------------ > > * Guest uses page cache in memory to process fast requests > > for disk read/write. This results in big memory footprint > > of guests without host knowing much details of the guest > > memory. > > > > * If guests use direct access(DAX) with fake persistent > > storage, the host manages the page cache for guests, > > allowing the host to easily reclaim/evict less frequently > > used page cache pages without requiring guest cooperation, > > like ballooning would. > > > > * Host manages guest cache as ‘mmaped’ disk image area in > > qemu address space. This region is passed to guest as fake > > persistent memory range. We need a new flushing interface > > to flush this cache to secondary storage to persist guest > > writes. > > > > * New asynchronous flushing interface will allow guests to > > cause the host flush the dirty data to backup storage file. > > Systems with pmem storage make use of CLFLUSH instruction > > to flush single cache line to persistent storage and it > > takes care of flushing. With fake persistent storage in > > guest we cannot depend on CLFLUSH instruction to flush entire > > dirty cache to backing storage. Even If we trap and emulate > > CLFLUSH instruction guest vCPU has to wait till we flush all > > the dirty memory. Instead of this we need to implement a new > > asynchronous guest flushing interface, which allows the guest > > to specify a larger range to be flushed at once, and allows > > the vCPU to run something else while the data is being synced > > to disk. > > > > * New flushing interface will consists of a para virt driver to > > new fake nvdimm like device which will process guest flushing > > requests like fsync/msync etc instead of pmem library calls > > like clflush. The corresponding device at host side will be > > responsible for flushing requests for guest dirty pages. > > Guest can put current task in sleep and vCPU can run any other > > task while host side flushing of guests pages is in progress. > > > > Host controlled fake nvdimm DAX to avoid guest page cache : > > ------------------------------------------------------------- > > * Bypass guest page cache by using a fake persistent storage > > like nvdimm & DAX. Guest Read/Write is directly done on > > fake persistent storage without involving guest kernel for > > caching data. > > > > * Fake nvdimm device passed to guest is backed by a regular > > file in host stored in secondary storage. > > > > * Qemu has implementation of fake NVDIMM/DAX device. Use this > > capability of passing regular host file(disk) as nvdimm device > > to guest. > > > > * Nvdimm with DAX works for ext4/xfs filesystem. Supported > > filesystem should be DAX compatible. > > > > * As we are using guest disk as fake DAX/NVDIMM device, we > > need a mechanism for persistence of data backed on regular > > host storage file. > > > > * For live migration use case, if host side backing file is > > shared storage, we need to flush the page cache for the disk > > image at the destination (new fadvise interface, FADV_INVALIDATE_CACHE?) > > before starting execution of the guest on the destination host. > > Good point. QEMU currently only supports live migration with O_DIRECT. > I think the problem was that userspace cannot guarantee consistency in > the general case. If you find a solution to this problem for fake > NVDIMM then maybe the QEMU block layer can also begin supporting live > migration with buffered I/O. > > > > > Design : > > --------- > > * In order to not have page cache inside the guest, qemu would: > > > > 1) mmap the guest's disk image and present that disk image to > > the guest as a persistent memory range. > > > > 2) Present information to the guest telling it that the persistent > > memory range is not physical persistent memory. > > Steps 1 & 2 are already supported by QEMU NVDIMM emulation today. Yes. I have also tested guest 'fake DAX' device using QEMU NVDIMM emulation. > > > 3) Present an additional paravirt device alongside the persistent > > memory range, that can be used to sync (ranges of) data to disk. > > > > * Guest would use the disk image mostly like a persistent memory > > device, with two exceptions: > > > > 1) It would not tell userspace that the files on that device are > > persistent memory. This is done so userspace knows to call > > fsync/msync, instead of the pmem clflush library call. > > Not sure I agree with hiding the nvdimm nature of the device. Instead I > think you need to build this capability into the Linux nvdimm code. > libpmem will detect these types of devices and issue fsync/msync when > the application wants to flush. > > > 2) When userspace calls fsync/msync on files on the fake persistent > > memory device, issue a request through the paravirt device that > > causes the host to flush the device back end. > > > > * Guest uses fake persistent storage data updates can be still in > > qemu memory. We need a way to flush cached data in host to backed > > s/qemu memory/host memory/ > > I guess you mean that host userspace needs a way to reliably flush an > address range to the underlying storage. right. > > > secondary storage. > > > > * Once the guest receives a completion event from the host, it will > > allow userspace programs that were waiting on the fsync/msync to > > continue running. > > > > * Host is responsible for paging in pages in host backing area for > > guest persistent memory as they are accessed by the guest, and > > for evicting pages as host memory fills up. > > > > Questions : > > ----------- > > * What should the flushing interface between guest and host look > > like? > > A simple hack for prototyping is to instantiate an virtio-blk-pci for > the mmapped host file. The guest can send flush commands on the > virtio-blk-pci device but will otherwise use the mapped memory directly. okay. I will check this. > > > * Any suggestions to hook the IO caching code with KVM/Qemu or > > thoughts on how we should do it? > > > > * Thinking of implementing a guest para virt driver which will send > > guest requests to Qemu to flush data to disk. Not sure at this > > point how to tell userspace to work on this device as any regular > > device without considering it as persistent device. Any suggestions > > on this? > > > > * Not thought yet about ballooning impact. But feel this solution > > could be better than ballooning in long term? As we will be > > managing all guests cache from host side. > > > > * Not sure this solution works for ARM and other architectures and > > Windows? >