On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@xxxxxxxxxx> wrote: > On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote: >> [ adding Ross and Jan ] >> >> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@xxxxxxxxxx> >> wrote: >> > >> > The goal is to increase density of guests, by moving page >> > cache into the host (where it can be easily reclaimed). >> > >> > If we assume the guests will be backed by relatively fast >> > SSDs, a "whole device flush" from filesystem journaling >> > code (issued where the filesystem issues a barrier or >> > disk cache flush today) may be just what we need to make >> > that work. >> >> Ok, apologies, I indeed had some pieces of the proposal confused. >> >> However, it still seems like the storage interface is not capable of >> expressing what is needed, because the operation that is needed is a >> range flush. In the guest you want the DAX page dirty tracking to >> communicate range flush information to the host, but there's no >> readily available block i/o semantic that software running on top of >> the fake pmem device can use to communicate with the host. Instead >> you >> want to intercept the dax_flush() operation and turn it into a queued >> request on the host. >> >> In 4.13 we have turned this dax_flush() operation into an explicit >> driver call. That seems a better interface to modify than trying to >> map block-storage flush-cache / force-unit-access commands to this >> host request. >> >> The additional piece you would need to consider is whether to track >> all writes in addition to mmap writes in the guest as DAX-page-cache >> dirtying events, or arrange for every dax_copy_from_iter() >> operation() >> to also queue a sync on the host, but that essentially turns the host >> page cache into a pseudo write-through mode. > > I suspect initially it will be fine to not offer DAX > semantics to applications using these "fake DAX" devices > from a virtual machine, because the DAX APIs are designed > for a much higher performance device than these fake DAX > setups could ever give. Right, we don't need DAX, per se, in the guest. > > Having userspace call fsync/msync like done normally, and > having those coarser calls be turned into somewhat efficient > backend flushes would be perfectly acceptable. > > The big question is, what should that kind of interface look > like? To me, this looks much like the dirty cache tracking that is done in the address_space radix for the DAX case, but modified to coordinate queued / page-based flushing when the guest wants to persist data. The similarity to DAX is not storing guest allocated pages in the radix but entries that track dirty guest physical addresses.