On Tue, Feb 09, 2021 at 11:29:56AM -0800, Mike Kravetz wrote: > On 2/5/21 6:36 PM, Peter Xu wrote: > > On Fri, Feb 05, 2021 at 01:53:34PM -0800, Mike Kravetz wrote: > >> On 1/29/21 2:49 PM, Peter Xu wrote: > >>> On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote: > >>>> This is a RFC series to support userfaultfd upon shmem and hugetlbfs. > >> ... > >>> Huge & Mike, > >>> > >>> Would any of you have comment/concerns on the high-level design of this series? > >>> > >>> It would be great to know it, especially major objection, before move on to an > >>> non-rfc version. > >> > >> My apologies for not looking at this sooner. Even now, I have only taken > >> a very brief look at the hugetlbfs patches. > >> > >> Coincidentally, I am working on the 'BUG' that soft dirty does not work for > >> hugetlbfs. As you can imagine, there is some overlap in handling of wp ptes > >> set for soft dirty. In addition, pmd sharing must be disabled for soft dirty > >> as here and in Axel's uffd minor fault code. > > > > Interesting to know that we'll reach and need something common from different > > directions, especially when they all mostly happen at the same time. :) > > > > Is there a real "BUG" that you mentioned? I'd be glad to read about it if > > there is a link or something. > > > > Sorry, I was referring to a bugzilla bug not a BUG(). Bottom line is that > hugetlb was mostly overlooked when soft dirty support was added. A thread > mostly from me is at: > lore.kernel.org/r/999775bf-4204-2bec-7c3d-72d81b4fce30@xxxxxxxxxx > I am close to sending out a RFC, but keep getting distracted. Thanks. Indeed I see no reason to not have hugetlb supported for soft dirty. Tracking 1G huge pages could be too coarse and heavy, but 2M at least still seems reasonable. > > >> No objections to the overall approach based on my quick look. > > > > Thanks for having a look. > > > > So for hugetlb one major thing is indeed about the pmd sharing part, which > > seems that we've got very good consensus on. > > Yes > > > The other thing that I'd love to get some comment would be a shared topic with > > shmem in that: for a file-backed memory type, uffd-wp needs a consolidated way > > to record wr-protect information even if the pgtable entries were flushed. > > That comes from a fundamental difference between anonymous and file-backed > > memory in that anonymous pages keep all info in the pgtable entry, but > > file-backed memory is not, e.g., pgtable entries can be dropped at any time as > > long as page cache is there. > > Sorry, but I can not recall this difference for hugetlb pages. What operations > lead to flushing of pagetable entries? It would need to be something other > than unmap as it seems we want to lose the information in unmap IIUC. For hugetlbfs I know two cases. One is exactly huge pmd sharing as mentioned above, where we'll drop the pgtable entries for a specific process but the page cache will still exist. The other one is hugetlbfs_punch_hole(), where hugetlb_vmdelete_list() called before remove_inode_hugepages(). For uffd-wp, there will be a very small window that a wr-protected huge page can be written before the page is finally dropped in remove_inode_hugepages() but after pgtable entry flushed. In some apps that could cause data loss. > > > I goes to look at soft-dirty then regarding this issue, and there's actually a > > paragraph about it: > > > > While in most cases tracking memory changes by #PF-s is more than enough > > there is still a scenario when we can lose soft dirty bits -- a task > > unmaps a previously mapped memory region and then maps a new one at > > exactly the same place. When unmap is called, the kernel internally > > clears PTE values including soft dirty bits. To notify user space > > application about such memory region renewal the kernel always marks > > new memory regions (and expanded regions) as soft dirty. > > > > I feel like it just means soft-dirty currently allows false positives: we could > > have set the soft dirty bit even if the page is clean. And that's what this > > series wanted to avoid: it used the new concept called "swap special pte" to > > persistent that information even for file-backed memory. That all goes for > > avoiding those false positives. > > Yes, I have seen this with soft dirty. It really does not seem right. When > you first create a mapping, even before faulting in anything the vma is marked > VM_SOFTDIRTY and from the user's perspective all addresses/pages appear dirty. Right that seems not optimal. It is understandable since dirty info is indeed tolerant to false positives, so soft-dirty avoided this issue as uffd-wp wanted to solve in this series. It would be great to know if current approach in this series would work for us to remove those false positives. > > To be honest, I am not sure you want to try and carry per-process/per-mapping > wp information in the file. What this series does is trying to persist that information in pgtable entries, rather than in the file (or page cache). Frankly I can't say whether that's optimal either, so I'm always open to any comment. So far I think it's a valid solution, but it could always be possible that I missed something important. > In the comment about soft dirty above, it seems > reasonable that unmapping would clear all soft dirty information. Also, > unmapping would clear any uffd state/information. Right, unmap should always means "dropping all information in the ptes". It's in below patch that we tried to treat it differently: https://github.com/xzpeter/linux/commit/e958e9ee8d33e9a6602f93cdbe24a0c3614ab5e2 A quick summary of above patch: only if we unmap or truncate the hugetlbfs file, would we call hugetlb_vmdelete_list() with ZAP_FLAG_DROP_FILE_UFFD_WP (which means we'll drop all the information, including uffd-wp bit). Thanks, -- Peter Xu