Re: KVM "fake DAX" flushing interface - discussion

Dan Williams <dan.j.williams@xxxxxxxxx> · Sun, 23 Jul 2017 13:10:34 -0700

On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel <riel@xxxxxxxxxx> wrote:
> On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> [ adding Ross and Jan ]
>>
>> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel <riel@xxxxxxxxxx>
>> wrote:
>> >
>> > The goal is to increase density of guests, by moving page
>> > cache into the host (where it can be easily reclaimed).
>> >
>> > If we assume the guests will be backed by relatively fast
>> > SSDs, a "whole device flush" from filesystem journaling
>> > code (issued where the filesystem issues a barrier or
>> > disk cache flush today) may be just what we need to make
>> > that work.
>>
>> Ok, apologies, I indeed had some pieces of the proposal confused.
>>
>> However, it still seems like the storage interface is not capable of
>> expressing what is needed, because the operation that is needed is a
>> range flush. In the guest you want the DAX page dirty tracking to
>> communicate range flush information to the host, but there's no
>> readily available block i/o semantic that software running on top of
>> the fake pmem device can use to communicate with the host. Instead
>> you
>> want to intercept the dax_flush() operation and turn it into a queued
>> request on the host.
>>
>> In 4.13 we have turned this dax_flush() operation into an explicit
>> driver call. That seems a better interface to modify than trying to
>> map block-storage flush-cache / force-unit-access commands to this
>> host request.
>>
>> The additional piece you would need to consider is whether to track
>> all writes in addition to mmap writes in the guest as DAX-page-cache
>> dirtying events, or arrange for every dax_copy_from_iter()
>> operation()
>> to also queue a sync on the host, but that essentially turns the host
>> page cache into a pseudo write-through mode.
>
> I suspect initially it will be fine to not offer DAX
> semantics to applications using these "fake DAX" devices
> from a virtual machine, because the DAX APIs are designed
> for a much higher performance device than these fake DAX
> setups could ever give.

Right, we don't need DAX, per se, in the guest.

>
> Having userspace call fsync/msync like done normally, and
> having those coarser calls be turned into somewhat efficient
> backend flushes would be perfectly acceptable.
>
> The big question is, what should that kind of interface look
> like?

To me, this looks much like the dirty cache tracking that is done in
the address_space radix for the DAX case, but modified to coordinate
queued / page-based flushing when the guest  wants to persist data.
The similarity to DAX is not storing guest allocated pages in the
radix but entries that track dirty guest physical addresses.