> > > > Hello, > > > > We shared a proposal for 'KVM fake DAX flushing interface'. > > > > https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg02478.html > > > > In above link, > "Overall goal of project > is to increase the number of virtual machines that can be > run on a physical machine, in order to *increase the density* > of customer virtual machines" > > Is the fake persistent memory used as normal RAM in guest? If no, how > is it expected to be used in guest? Yes, guest will have a nvdimm DAX device and not use page cache for most of the operations. Host will manage memory requirement of all the guests. > > > We did initial POC in which we used 'virtio-blk' device to perform > > a device flush on pmem fsync on ext4 filesystem. They are few hacks > > to make things work. We need suggestions on below points before we > > start actual implementation. > > > > A] Problems to solve: > > ------------------ > > > > 1] We are considering two approaches for 'fake DAX flushing interface'. > > > > 1.1] fake dax with NVDIMM flush hints & KVM async page fault > > > > - Existing interface. > > > > - The approach to use flush hint address is already nacked upstream. > > > > - Flush hint not queued interface for flushing. Applications might > > avoid to use it. > > > > - Flush hint address traps from guest to host and do an entire fsync > > on backing file which itself is costly. > > > > - Can be used to flush specific pages on host backing disk. We can > > send data(pages information) equal to cache-line size(limitation) > > and tell host to sync corresponding pages instead of entire disk > > sync. > > > > - This will be an asynchronous operation and vCPU control is returned > > quickly. > > > > > > 1.2] Using additional para virt device in addition to pmem device(fake dax > > with device flush) > > > > - New interface > > > > - Guest maintains information of DAX dirty pages as exceptional > > entries in > > radix tree. > > > > - If we want to flush specific pages from guest to host, we need to > > send > > list of the dirty pages corresponding to file on which we are doing > > fsync. > > > > - This will require implementation of new interface, a new paravirt > > device > > for sending flush requests. > > > > - Host side will perform fsync/fdatasync on list of dirty pages or > > entire > > block device backed file. > > > > 2] Questions: > > ----------- > > > > 2.1] Not sure why WPQ flush is not a queued interface? We can force > > applications > > to call this? device DAX neither calls fsync/msync? > > > > 2.2] Depending upon interface we decide, we need optimal solution to sync > > range of pages? > > > > - Send range of pages from guest to host to sync asynchronously > > instead > > of syncing entire block device? > > e.g. a new virtio device to deliver sync requests to host? > > > > > - Other option is to sync entire disk backing file to make sure all > > the > > writes are persistent. In our case, backing file is a regular file > > on > > non NVDIMM device so host page cache has list of dirty pages which > > can be used either with fsync or similar interface. > > As the amount of dirty pages can be variant, the latency of each host > fsync is likely to vary in a large range. > > > > > 2.3] If we do host fsync on entire disk we will be flushing all the dirty > > data > > to backend file. Just thinking what would be better approach, > > flushing > > pages on corresponding guest file fsync or entire block device? > > > > 2.4] If we decide to choose one of the above approaches, we need to > > consider > > all DAX supporting filesystems(ext4/xfs). Would hooking code to > > corresponding > > fsync code of fs seems reasonable? Just thinking for flush hint > > address use-case? > > Or how flush hint addresses would be invoked with fsync or similar > > api? > > > > 2.5] Also with filesystem journalling and other mount options like > > barriers, > > ordered etc, how we decide to use page flush hint or regular fsync on > > file? > > > > 2.6] If at guest side we have PFN of all the dirty pages in radixtree? and > > we send > > these to to host? At host side would we able to find corresponding > > page and flush > > them all? > > That may require the host file system provides API to flush specified > blocks/extents and their meta data in the file system. I'm not > familiar with this part and don't know whether such API exists. > > Haozhong >