Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

Marc Zyngier <maz@xxxxxxxxxx> · Wed, 24 Aug 2022 21:57:13 +0100

On Wed, 24 Aug 2022 17:21:50 +0100,
Peter Xu <peterx@xxxxxxxxxx> wrote:
> 
> On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
> > On Wed, 24 Aug 2022 00:19:04 +0100,
> > Peter Xu <peterx@xxxxxxxxxx> wrote:
> > > 
> > > On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
> > > > Atomicity doesn't guarantee ordering, unfortunately.
> > > 
> > > Right, sorry to be misleading.  The "atomicity" part I was trying to say
> > > the kernel will always see consistent update on the fields.
> > >
> > > The ordering should also be guaranteed, because things must happen with
> > > below sequence:
> > > 
> > >   (1) kernel publish dirty GFN data (slot, offset)
> > >   (2) kernel publish dirty GFN flag (set to DIRTY)
> > >   (3) user sees DIRTY, collects (slots, offset)
> > >   (4) user sets it to RESET
> > >   (5) kernel reads RESET
> > 
> > Maybe. Maybe not. The reset could well be sitting in the CPU write
> > buffer for as long as it wants and not be seen by the kernel if the
> > read occurs on another CPU. And that's the crucial bit: single-CPU is
> > fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
> > on collection, and global on reset (this seems like a bad decision,
> > but it is too late to fix this).
> 
> Regarding the last statement, that's something I had question too and
> discussed with Paolo, even though at that time it's not a outcome of
> discussing memory ordering issues.
> 
> IIUC the initial design was trying to avoid tlb flush flood when vcpu
> number is large (each RESET per ring even for one page will need all vcpus
> to flush, so O(N^2) flushing needed). With global RESET it's O(N).  So
> it's kind of a trade-off, and indeed until now I'm not sure which one is
> better.  E.g., with per-ring reset, we can have locality too in userspace,
> e.g. the vcpu thread might be able to recycle without holding global locks.

I don't get that. On x86, each CPU must perform the TLB invalidation
(there is an IPI for that). So whether you do a per-CPU scan of the
ring or a global scan is irrelevant: each entry you find in any of the
rings must result in a global invalidation, since you've updated the
PTE to make the page RO.

The same is true on ARM, except that the broadcast is done in HW
instead of being tracked in SW.

Buy anyway, this is all moot. The API is what it is, and it isn't
going to change any time soon. All we can do is add some
clarifications to the API for the more relaxed architectures, and make
sure the kernel behaves accordingly.

[...]

> > It may be safe, but it isn't what the userspace API promises.
> 
> The document says:
> 
>   After processing one or more entries in the ring buffer, userspace calls
>   the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that
>   the kernel will reprotect those collected GFNs.  Therefore, the ioctl
>   must be called *before* reading the content of the dirty pages.
> 
> I'd say it's not an explicit promise, but I think I agree with you that at
> least it's unclear on the behavior.

This is the least problematic part of the documentation. The bit I
literally choke on is this:

<quote>
It's not necessary for userspace to harvest the all dirty GFNs at once.
However it must collect the dirty GFNs in sequence, i.e., the userspace
program cannot skip one dirty GFN to collect the one next to it.
</quote>

This is the core of the issue. Without ordering rules, the consumer on
the other side cannot observe the updates correctly, even if userspace
follows the above to the letter. Of course, the kernel itself must do
the right thing (I guess it amounts to the kernel doing a
load-acquire, and userspace doing a store-release -- effectively
emulating x86...).

> Since we have a global recycle mechanism, most likely the app (e.g. current
> qemu impl) will use the same thread to collect/reset dirty GFNs, and
> trigger the RESET ioctl().  In that case it's safe, IIUC, because no
> cross-core ops.
> 
> QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
> 
>     if (total) {
>         ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
>         assert(ret == total);
>     }
> 
> I think the assert() should never trigger as mentioned above.  But ideally
> maybe it should just be a loop until cleared gfns match total.

Right. If userspace calls the ioctl on every vcpu, then things should
work correctly. It is only that the overhead is higher than what it
should be if multiple vcpus perform a reset at the same time.

>
> > In other words, without further straightening of the API, this doesn't
> > work as expected on relaxed memory architectures. So before this gets
> > enabled on arm64, this whole ordering issue must be addressed.
> 
> How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the
> possibility of recycling partial of the pages, especially when collection
> and the ioctl() aren't done from the same thread?

I'd rather tell people about the ordering rules. That will come at
zero cost on x86.

> Any suggestions will be greatly welcomed.

I'll write a couple of patch when I get the time, most likely next
week. Gavin will hopefully be able to take them as part of his series.

	M.

-- 
Without deviation from the norm, progress is not possible.