> Subject: Re: KVM "fake DAX" flushing interface - discussion > > On Mon 24-07-17 08:06:07, Pankaj Gupta wrote: > > > > > On Sun 23-07-17 13:10:34, Dan Williams wrote: > > > > On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@xxxxxxxxxx> wrote: > > > > > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote: > > > > >> [ adding Ross and Jan ] > > > > >> > > > > >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@xxxxxxxxxx> > > > > >> wrote: > > > > >> > > > > > >> > The goal is to increase density of guests, by moving page > > > > >> > cache into the host (where it can be easily reclaimed). > > > > >> > > > > > >> > If we assume the guests will be backed by relatively fast > > > > >> > SSDs, a "whole device flush" from filesystem journaling > > > > >> > code (issued where the filesystem issues a barrier or > > > > >> > disk cache flush today) may be just what we need to make > > > > >> > that work. > > > > >> > > > > >> Ok, apologies, I indeed had some pieces of the proposal confused. > > > > >> > > > > >> However, it still seems like the storage interface is not capable of > > > > >> expressing what is needed, because the operation that is needed is a > > > > >> range flush. In the guest you want the DAX page dirty tracking to > > > > >> communicate range flush information to the host, but there's no > > > > >> readily available block i/o semantic that software running on top of > > > > >> the fake pmem device can use to communicate with the host. Instead > > > > >> you > > > > >> want to intercept the dax_flush() operation and turn it into a > > > > >> queued > > > > >> request on the host. > > > > >> > > > > >> In 4.13 we have turned this dax_flush() operation into an explicit > > > > >> driver call. That seems a better interface to modify than trying to > > > > >> map block-storage flush-cache / force-unit-access commands to this > > > > >> host request. > > > > >> > > > > >> The additional piece you would need to consider is whether to track > > > > >> all writes in addition to mmap writes in the guest as DAX-page-cache > > > > >> dirtying events, or arrange for every dax_copy_from_iter() > > > > >> operation() > > > > >> to also queue a sync on the host, but that essentially turns the > > > > >> host > > > > >> page cache into a pseudo write-through mode. > > > > > > > > > > I suspect initially it will be fine to not offer DAX > > > > > semantics to applications using these "fake DAX" devices > > > > > from a virtual machine, because the DAX APIs are designed > > > > > for a much higher performance device than these fake DAX > > > > > setups could ever give. > > > > > > > > Right, we don't need DAX, per se, in the guest. > > > > > > > > > > > > > > Having userspace call fsync/msync like done normally, and > > > > > having those coarser calls be turned into somewhat efficient > > > > > backend flushes would be perfectly acceptable. > > > > > > > > > > The big question is, what should that kind of interface look > > > > > like? > > > > > > > > To me, this looks much like the dirty cache tracking that is done in > > > > the address_space radix for the DAX case, but modified to coordinate > > > > queued / page-based flushing when the guest wants to persist data. > > > > The similarity to DAX is not storing guest allocated pages in the > > > > radix but entries that track dirty guest physical addresses. > > > > > > Let me check whether I understand the problem correctly. So we want to > > > export a block device (essentially a page cache of this block device) to > > > a > > > guest as PMEM and use DAX in the guest to save guest's page cache. The > > > > that's correct. > > > > > natural way to make the persistence work would be to make ->flush > > > callback > > > of the PMEM device to do an upcall to the host which could then > > > fdatasync() > > > appropriate image file range however the performance would suck in such > > > case since ->flush gets called for at most one page ranges from DAX. > > > > Discussion is : sync a range using paravirt device or flush hit addresses > > vs block device flush. > > > > > > > > So what you could do instead is to completely ignore ->flush calls for > > > the > > > PMEM device and instead catch the bio with REQ_PREFLUSH flag set on the > > > PMEM device (generated by blkdev_issue_flush() or the journalling > > > machinery) and fdatasync() the whole image file at that moment - in fact > > > you must do that for metadata IO to hit persistent storage anyway in your > > > setting. This would very closely follow how exporting block devices with > > > volatile cache works with KVM these days AFAIU and the performance will > > > be > > > the same. > > > > yes 'blkdev_issue_flush' does set 'REQ_OP_WRITE | REQ_PREFLUSH' flags. > > As per suggestions looks like block flushing device is way ahead. > > > > If we do an asynchronous block flush at guest side(put current task in > > wait queue till host side fdatasync completes) can solve the purpose? Or > > do we need another paravirt device for this? > > Well, even currently if you have PMEM device, you still have also a block > device and a request queue associated with it and metadata IO goes through > that path. So in your case you will have the same in the guest as a result > of exposing virtual PMEM device to the guest and you just need to make sure > this virtual block device behaves the same way as traditional virtualized > block devices in KVM in respose to 'REQ_OP_WRITE | REQ_PREFLUSH' requests. Looks like only way to send flush(blk dev) from guest to host with nvdimm is using flush hint addresses. Is this the correct interface I am looking? blkdev_issue_flush submit_bio_wait submit_bio generic_make_request pmem_make_request ... if (bio->bi_opf & REQ_FLUSH) nvdimm_flush(nd_region); ... Thanks, Pankaj > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR >