Re: Reference count on pages held in secondary MMUs

Christoffer Dall <christoffer.dall@xxxxxxx> · Wed, 26 Jun 2019 14:16:13 +0200

On Sat, Jun 22, 2019 at 03:11:36PM -0400, Andrea Arcangeli wrote:
> Hello Christoffer,
> 
> On Tue, Jun 11, 2019 at 01:51:32PM +0200, Christoffer Dall wrote:
> > Sorry, what does this mean?  Do you mean that we can either do:
> > 
> >   on_vm_s2_fault() {
> >       page = gup();
> >       map_page_in_s2_mmu();
> >       put_page();
> >   }
> > 
> >   mmu_notifier_invalidate() {
> >       unmap_page_in_s2_mmu();
> >   }
> > 
> > or
> > 
> >   on_vm_s2_fault() {
> >       page = gup();
> >       map_page_in_s2_mmu();
> >   }
> > 
> >   mmu_notifier_invalidate() {
> >       unmap_page_in_s2_mmu();
> >       put_page();
> >   }
> 
> Yes both work, refcounting always works.
> 
> > > and in fact Jerome also thinks
> > > like me that we should eventually optimize away the FOLL_GET and not
> > > take the refcount in the first place, 
> > 
> > So if I understood the above correct, the next point is that there are
> > advantages to avoiding keeping the extra reference on that page, because
> > we have problematic race conditions related to set_page_dirty(), and we
> > can reduce the problem of race conditions further by not getting a
> > reference on the page at all when going GUP as part of a KVM fault?
> 
> You could still keep the extra reference until the
> invalidate.
> 
> The set_page_dirty however if you do in the context of the secondary
> MMU fault (i.e. atomically with the mapping of the page in the
> secondary MMU, with respect of MMU notifier invalidates), it solves
> the whole problem with the ->mkwrite/mkclean and then you can keep a
> GUP long term pin fully safely already. That is a solution that always
> works and becomes guaranteed by design by the MMU notifier not to
> interfere with the _current_ writeback code in the filesystem. It also
> already provides stable pages.
> 

Ok.

> > Can you explain, or provide a pointer to, the root cause of the
> > problem with holding a reference on the page and setting it dirty?
> 
> The filesystem/VM doesn't possibly expect set_page_dirty to be called
> again after it called page_mkclean. Supposedly a wrprotect fault
> should have been generated if somebody tried to write to the page
> under writeback, so page_mkwrite should have run again before you
> could have called set_page_dirty.
> 
> Instead page_mkclean failed to get rid of the long term GUP obtained
> with FOLL_WRITE because it simply can't ask the device to release it
> without MMU notifier, so the device can later still call
> set_page_dirty despite page_mkclean already run.
> 

I see, I'm now able to link this to recent articles on LWN.

> > > but a whole different chapter is
> > > dedicated on the set_page_dirty_lock crash on MAP_SHARED mappings
> > > after long term GUP pins. So since you're looking into how to handle
> > > the page struct in the MMU notifier it's worth mentioning the issues
> > > related to set_page_dirty too.
> > 
> > Is there some background info on the "set_page_dirty_lock crash on
> > MAP_SHARED" ?  I'm having trouble following this without the background.
> 
> Jan Kara leaded the topic explained all the details on this filesystem
> issue at the LSF-MM and also last year.
> 
> Which is what makes me think there can't be too many uses cases that
> require writback to work while long term GUP pin allow some device to
> write to the pages at any given time, if nobody requires this to be
> urgently fixed.
> 
> You can find coverage on lwn and on linux-mm.
> 
> > 
> > > 
> > > To achieve the cleanest writeback fix to avoid crashes in
> > > set_page_dirty_lock on long term secondary MMU mappings that supports
> > > MMU notifier like KVM shadow MMU, the ideal is to mark the page dirty
> > > before establishing a writable the mapping in the secondary MMU like
> > > in the model below.
> > > 
> > > The below solution works also for those secondary MMU that are like a
> > > TLB and if there are two concurrent invalidates on the same page
> > > invoked at the same time (a potential problem Jerome noticed), you
> > > don't know which come out first and you would risk to call
> > > set_page_dirty twice, which would be still potentially kernel crashing
> > > (even if only a theoretical issue like O_DIRECT).
> > 
> > Why is it problematic to call set_page_dirty() twice?  I thought that at
> > worst it would only lead to writing out data to disk unnecessarily ?
> 
> According to Jerome, after the first set_page_dirty returns, writeback
> could start before the second set_page_dirty has been called. So if
> there are additional random later invalidates the next ones shouldn't
> call set_page_dirty again.
> 
> The problem is if you call set_page_dirty in the invalidate, you've
> also to make sure set_page_dirty is being called only once.
> 
> There can be concurrent invalidates for the same page running at the
> same time, while the page fault there is only one that runs atomically
> with respect to the mmu notifier invalidates (under whatever lock that
> serializes the MMU notifier invalidates vs the secondary MMU page fault).
> 
> If you call set_page_dirty twice in a row, you again open the window
> for the writeback to have already called ->page_mkclean on the page
> after the first set_page_dirty, so the second set_page_dirty will
> then crash.
> 
> You can enforce to call it only once if you have sptes (shadow
> pagetables) like in KVM has, so this is not an issue for KVM.
> 

Makes sense.  The key for my understanding was that an atomic
relationship between the page fault and the mmu notifier has to be
enforced.

> > I am also not familiar with a problem related to KVM and O_DIRECT, so
> > I'm having trouble keeping up here as well :(
> 
> There's no problem in KVM and O_DIRECT.
> 
> There's a problem in O_DIRECT itself regardless if it's qemu or any
> other app using it: just the time window is too low to be
> noticeable. It's still a corollary of why we can't run two
> set_page_dirty per page, if there are concurrent MMU notifier
> invalidates.
> 
> > > So the below model
> > > will solve that and it's also valid for KVM/vhost accelleration,
> > > despite KVM can figure out how to issue a single set_page_dirty call
> > > for each spte that gets invalidated by concurrent invalidates on the
> > > same page because it has shadow pagetables and it's not just a TLB.
> > > 
> > >   access = FOLL_WRITE|FOLL_GET
> > > 
> > > repeat:
> > >   page = gup(access)
> > >   put_page(page)
> > > 
> > >   spin_lock(mmu_notifier_lock);
> > >   if (race with invalidate) {
> > >     spin_unlock..
> > >     goto repeat;
> > >   }
> > >   if (access == FOLL_WRITE)
> > >     set_page_dirty(page)
> > >   establish writable mapping in secondary MMU on page
> > >   spin_unlock
> > > 
> > > The above solves the crash in set_page_dirty_lock without having to
> > > modify any filesystem, it should work theoretically safer than the
> > > O_DIRECT short term GUP pin.
> > 
> > That is not exactly how we do things today on the arm64 side.  We do
> > something that looks like:
> 
> The above is the model that solves all problems with writeback
> page_mkclean/mkwrite, provides stable pages to current filesystems,
> regardless of lowlevel implementation details of the mmu notifier
> methods.
> 
> For KVM all models works not only the above one because we have sptes
> to disambiguate which is the first invalidate that has to run
> set_page_dirty.
> 
> > 
> >   /*
> >    * user_mem_abort is our function for a secondary MMU fault that
> >    * resolves to a memslot.
> >    */
> >   user_mem_abort() {
> >       page = gup(access, &writable);
> >       spin_lock(&kvm->mmu_lock);
> >       if (mmu_notifier_retry(kvm, mmu_seq))
> >           goto out; /* run the VM again and see what happens */
> > 
> >       if (writable)
> >           kvm_set_pfn_dirty(page_to_pfn(page));
> >       stage2_set_pte(); /* establish_writable mapping in secondary MMU on page */
> > 
> >   out:
> >       spin_unlock(&kvm_mmu_lock);
> >       put_page(page);
> >   }
> > 
> > Should we rework this to address the race you are refering to, and are
> > other architectures already safe against this?
> 
> Actually it seems you mark the page dirty exactly where I suggested
> above: i.e. atomically with the secondary MMU mapping establishment
> with respect to the mmu notifier invalidates.
> 
> I don't see any problem with the above (well you need to have a way to
> track if you run stage2_set_pte or if you taken "goto out" but the
> above is pseudocode).
> 
> There's a problem however in kvm_set_pfn_dirty common code, it should
> call set_page_dirty not SetPageDirty or it won't do anything in the
> MAP_SHARED filebacked case. The current code is perfectly ok for anon and
> MAP_PRIVATE write=1 cases.
> 
> However FOLL_TOUCH in gup already either calls set_page_dirty or it
> marks the linux pte as dirty, so that's working around the lack of
> set_page_dirty... I wonder if we could just rely on the set_page_dirty
> in gup with FOLL_TOUCH and drop SetPageDirty as a whole in KVM in fact.
> 

I may have misled you with my use of 'gup()' as a function above.  In
reality we use the gfn_to_pfn_prot() wrapper, which means there are
several things going on:

First, it appears we only do pte_mkdirty with FOLL_TOUCH if we also set
FOLL_WRITE.  This seems to be confimed by the commentary on
get_user_pages, which says that set_page_dirty() must be called if a
page is written to, and the page is found without FOLL_WRITE.

Second, KVM first attempts a __get_user_pages_fast(), and if that fails
does get_user_pages_unlocked(), but only sets FOLL_WRITE as part of a
write fauult.  If the page is nevertheless mapped writable, I think we
still need the subsequent set_page_dirty() when a write fault happens on
the secondary mmu.

So I take it you'll send a patch addressing the SetPageDirty to
set_page_dirty problem?

> > > With regard to KVM this should be enough, but we also look for a crash
> > > avoidance solution for those devices that cannot support the MMU
> > > notifier for short and long term GUP pins.
> > 
> > Sorry, can you define short and long term GUP pins, and do we have
> > current examples of both?
> 
> Long term as in mm/gup.c:FOLL_LONGTERM, means you expect to call some
> get_user_pages with FOLL_GET and not release the refcount immediately
> after I/O completion, the page could remain indefinitely pinned,
> virt example: vfio device assignment.
> 
> The most common example of short term GUP (i.e. default behavior of
> GPU) is O_DIRECT. I/O completion takes short time (depends on the
> buffer size of course.. could be 1TB of buffer I/O) but it's still
> going to be released ASAP.
> 
> > > There's lots of work going on on linux-mm, to try to let those devices
> > > support writeback in a safe way (also with stable pages so all fs
> > > integrity checks will pass) using bounce buffer if a long term GUP pin
> > > is detected by the filesystem. In addition there's other work to make
> > > the short term GUP pin theoretically safe by delaying the writeback
> > > for the short window the GUP pin is taken by O_DIRECT, so it becomes
> > > theoretically safe too (currently it's only practically safe).
> > > 
> > > However I'm not sure if the long term GUP pins really needs to support
> > > writeback.
> > > 
> > 
> > I don't think I understand the distinction between a long term GUP pin
> > that supports writeback vs. a short term GUP pin.  My question was
> > whether or not the pin could be dropped at the time the mapping was torn
> > down, or if it has to be done at the same time the mapping is
> > established, for things to work properly wrt. the semantics of memory
> > behavior of the rest of the kernel.
> 
> Yes sorry, the question about the refcounting was just trivial to
> answer: it always works, you can drop the refcount anywhere.
> 
> I just thought if there was any doubt on the refcounting issue which
> had an immediate black and white answer, it was safer to raise
> awareness about the much more troubling and subtle issues with
> set_page_dirty caused by GUP refcounts.
> 

I understand, and thanks for doing that.  I just had to put the pieces
together on my end.

> > I'm not sure if we're talking about the same thing here, or if you're
> > explaining a different scenario?
> 
> Simply KVM secondary MMU fault has to mark the page dirty somehow
> (either in gup itself or in the fault or in the invalidates) in
> addition to dealing the refcount. That's the connection.
> 
> However this set_page_dirty issue needs solution that works both for
> short term GPU pins that can't use mmu notifier (O_DIRECT), long term
> GUP pins that can't use mmu notifier (vfio) and the MMU notifier
> mappings (KVM page fault, whose refcount can be implicitly hold on by
> the MMU notifier itself and in turn the put_page can go anywhere).
> 
> The solution to the O_DIRECT gup pin is also highly connected with the
> GUP refcounting, because the solution is to alter the refcount so that
> the filesystem learns that there's a special refcounting and
> ->page_mkclean can be then deferred as long as the special refcount is
> hold by O_DIRECT. I argue the same special refcount can be also hold
> by long term GUP pins and the MMU notifier KVM page fault case (which
> will either drop FOLL_GET ideally) so we can defer the page_mkwrite
> indefinitely for long term GUP pins too (there will be no deferred
> write in MMU Notifier case because ->page_mkclean invokes the MMU
> notifier invalidates).
> 
> If one wants to flush to disk the dirty MAP_SHARED periodically the
> device needs to be told to drop the refcounts and stop writing to the
> memory. If the device isn't told to stop writing to the memory what
> comes out is only coherent at 512 bytes units or maybe less anyway.
> 

Thanks for the clarification.

    Christoffer