Re: [PATCH RFC 00/30] userfaultfd-wp: Support shmem and hugetlbfs

Peter Xu <peterx@xxxxxxxxxx> · Tue, 9 Feb 2021 17:00:56 -0500

On Tue, Feb 09, 2021 at 11:29:56AM -0800, Mike Kravetz wrote:
> On 2/5/21 6:36 PM, Peter Xu wrote:
> > On Fri, Feb 05, 2021 at 01:53:34PM -0800, Mike Kravetz wrote:
> >> On 1/29/21 2:49 PM, Peter Xu wrote:
> >>> On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote:
> >>>> This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
> >> ...
> >>> Huge & Mike,
> >>>
> >>> Would any of you have comment/concerns on the high-level design of this series?
> >>>
> >>> It would be great to know it, especially major objection, before move on to an
> >>> non-rfc version.
> >>
> >> My apologies for not looking at this sooner.  Even now, I have only taken
> >> a very brief look at the hugetlbfs patches.
> >>
> >> Coincidentally, I am working on the 'BUG' that soft dirty does not work for
> >> hugetlbfs.  As you can imagine, there is some overlap in handling of wp ptes
> >> set for soft dirty.  In addition, pmd sharing must be disabled for soft dirty
> >> as here and in Axel's uffd minor fault code.
> > 
> > Interesting to know that we'll reach and need something common from different
> > directions, especially when they all mostly happen at the same time. :)
> > 
> > Is there a real "BUG" that you mentioned?  I'd be glad to read about it if
> > there is a link or something.
> > 
> 
> Sorry, I was referring to a bugzilla bug not a BUG().  Bottom line is that
> hugetlb was mostly overlooked when soft dirty support was added.  A thread
> mostly from me is at:
> lore.kernel.org/r/999775bf-4204-2bec-7c3d-72d81b4fce30@xxxxxxxxxx
> I am close to sending out a RFC, but keep getting distracted.

Thanks.  Indeed I see no reason to not have hugetlb supported for soft dirty.
Tracking 1G huge pages could be too coarse and heavy, but 2M at least still
seems reasonable.

> 
> >> No objections to the overall approach based on my quick look.
> > 
> > Thanks for having a look.
> > 
> > So for hugetlb one major thing is indeed about the pmd sharing part, which
> > seems that we've got very good consensus on.
> 
> Yes
> 
> > The other thing that I'd love to get some comment would be a shared topic with
> > shmem in that: for a file-backed memory type, uffd-wp needs a consolidated way
> > to record wr-protect information even if the pgtable entries were flushed.
> > That comes from a fundamental difference between anonymous and file-backed
> > memory in that anonymous pages keep all info in the pgtable entry, but
> > file-backed memory is not, e.g., pgtable entries can be dropped at any time as
> > long as page cache is there.
> 
> Sorry, but I can not recall this difference for hugetlb pages.  What operations
> lead to flushing of pagetable entries?  It would need to be something other
> than unmap as it seems we want to lose the information in unmap IIUC.

For hugetlbfs I know two cases.

One is exactly huge pmd sharing as mentioned above, where we'll drop the
pgtable entries for a specific process but the page cache will still exist.

The other one is hugetlbfs_punch_hole(), where hugetlb_vmdelete_list() called
before remove_inode_hugepages().  For uffd-wp, there will be a very small
window that a wr-protected huge page can be written before the page is finally
dropped in remove_inode_hugepages() but after pgtable entry flushed.  In some
apps that could cause data loss.

> 
> > I goes to look at soft-dirty then regarding this issue, and there's actually a
> > paragraph about it:
> > 
> >         While in most cases tracking memory changes by #PF-s is more than enough
> >         there is still a scenario when we can lose soft dirty bits -- a task
> >         unmaps a previously mapped memory region and then maps a new one at
> >         exactly the same place. When unmap is called, the kernel internally
> >         clears PTE values including soft dirty bits. To notify user space
> >         application about such memory region renewal the kernel always marks
> >         new memory regions (and expanded regions) as soft dirty.
> > 
> > I feel like it just means soft-dirty currently allows false positives: we could
> > have set the soft dirty bit even if the page is clean.  And that's what this
> > series wanted to avoid: it used the new concept called "swap special pte" to
> > persistent that information even for file-backed memory.  That all goes for
> > avoiding those false positives.
> 
> Yes, I have seen this with soft dirty.  It really does not seem right.  When
> you first create a mapping, even before faulting in anything the vma is marked
> VM_SOFTDIRTY and from the user's perspective all addresses/pages appear dirty.

Right that seems not optimal.  It is understandable since dirty info is indeed
tolerant to false positives, so soft-dirty avoided this issue as uffd-wp wanted
to solve in this series.  It would be great to know if current approach in this
series would work for us to remove those false positives.

> 
> To be honest, I am not sure you want to try and carry per-process/per-mapping
> wp information in the file.

What this series does is trying to persist that information in pgtable entries,
rather than in the file (or page cache).  Frankly I can't say whether that's
optimal either, so I'm always open to any comment.  So far I think it's a valid
solution, but it could always be possible that I missed something important.

> In the comment about soft dirty above, it seems
> reasonable that unmapping would clear all soft dirty information.  Also,
> unmapping would clear any uffd state/information.

Right, unmap should always means "dropping all information in the ptes".  It's
in below patch that we tried to treat it differently:

https://github.com/xzpeter/linux/commit/e958e9ee8d33e9a6602f93cdbe24a0c3614ab5e2

A quick summary of above patch: only if we unmap or truncate the hugetlbfs
file, would we call hugetlb_vmdelete_list() with ZAP_FLAG_DROP_FILE_UFFD_WP
(which means we'll drop all the information, including uffd-wp bit).

Thanks,

-- 
Peter Xu