Re: [PATCH RFC 10/24] userfaultfd: wp: add WP pagetable tracking to x86

Jerome Glisse <jglisse@xxxxxxxxxx> · Thu, 24 Jan 2019 10:40:50 -0500

On Thu, Jan 24, 2019 at 01:16:16PM +0800, Peter Xu wrote:
> On Mon, Jan 21, 2019 at 10:09:38AM -0500, Jerome Glisse wrote:
> > On Mon, Jan 21, 2019 at 03:57:08PM +0800, Peter Xu wrote:
> > > From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> > > 
> > > Accurate userfaultfd WP tracking is possible by tracking exactly which
> > > virtual memory ranges were writeprotected by userland. We can't relay
> > > only on the RW bit of the mapped pagetable because that information is
> > > destroyed by fork() or KSM or swap. If we were to relay on that, we'd
> > > need to stay on the safe side and generate false positive wp faults
> > > for every swapped out page.
> 
> (I'm trying to leave comments with my own understanding here; they
>  might not be the original purposes when Andrea proposed the idea.
>  Andrea, please feel free to chim in anytime especially if I am
>  wrong... :-)
> 
> > 
> > So you want to forward write fault (of a protected range) to user space
> > only if page is not write protected because of fork(), KSM or swap.
> > 
> > This write protection feature is only for anonymous page right ? Other-
> > wise how would you protect a share page (ie anyone can look it up and
> > call page_mkwrite on it and start writting to it) ?
> 
> AFAIU we want to support shared memory too in the future.  One example
> I can think of is current QEMU usage with DPDK: we have two processes
> sharing the guest memory range.  So indeed this might not work if
> there are unknown/malicious users of the shared memory, however in
> many use cases the users are all known and AFAIU we should just write
> protect all these users then we'll still get notified when any of them
> write to a page.
> 
> > 
> > So for anonymous page for fork() the mapcount will tell you if page is
> > write protected for COW. For KSM it is easy check the page flag.
> 
> Yes I agree that KSM should be easy.  But for COW, please consider
> when we write protect a page that was shared and RW removed due to
> COW.  Then when we page fault on this page should we report to the
> monitor?  IMHO we can't know if without a specific bit in the PTE.
> 
> > 
> > For swap you can use the page lock to synchronize. A page that is
> > write protected because of swap is write protected because it is being
> > write to disk thus either under page lock, or with PageWriteback()
> > returning true while write is on going.
> 
> For swap I think the major problem is when the page was swapped out of
> main memory and then we write to the page (which was already a swap
> entry now).  Then we'll first swap in the page into main memory again,
> but then IMHO we will face the similar issue like COW above - we can't
> judge whether this page was write protected by uffd-wp at all.  Of
> course here we can detect the VMA flags and assuming it's write
> protected if the UFFD_WP flag was set on the VMA flag, however we'll
> also mark those pages which were not write protected at all hence
> it'll generate false positives of write protection messages.  This
> idea can apply too to above COW use case.  As a conclusion, in these
> use cases we should not be able to identify explicitly on page
> granularity write protection if without a specific _PAGE_UFFD_WP bit
> in the PTE entries.

So i need to think a bit more on this, probably not right now
but just so i get the chain of event properly:
  1 - user space ioctl UFD to write protect a range
  2 - UFD set a flag on the vma and update CPU page table
  3 - page can be individualy write faulted and it sends a
      signal to UFD listener and they handle the fault
  4 - UFD kernel update the page table once userspace have
      handled the fault and sent result to UFD. At this point
      the vma still has the UFD write protect flag set.

So at any point in time in a range you might have writeable
pte that correspond to already handled UFD write fault. Now
if COW,KSM or swap happens on those then on the next write
fault you do not want to send a signal to userspace but handle
the fault just as usual ?

I believe this is the event flow, so i will ponder on this some
more :)

Cheers,
Jérôme