On Thu, 17 Feb 2022, Matthew Wilcox wrote: > On Wed, Feb 16, 2022 at 05:25:17PM -0800, Hugh Dickins wrote: > > On Wed, 16 Feb 2022, Mike Kravetz wrote: > > > On 2/14/22 23:37, cgel.zte@xxxxxxxxx wrote: ... > > > > @@ -39,8 +40,12 @@ static void memfd_tag_pins(struct xa_state *xas) > > > > xas_for_each(xas, page, ULONG_MAX) { > > > > if (xa_is_value(page)) > > > > continue; > > > > + > > > > page = find_subpage(page, xas->xa_index); > > > > - if (page_count(page) - page_mapcount(page) > 1) > > > > + count = page_count(page); > > > > + if (PageTransCompound(page)) > > > > > > PageTransCompound() is true for hugetlb pages as well as THP. And, hugetlb > > > pages will not have a ref per subpage as THP does. So, I believe this will > > > break hugetlb seal usage. > > > > Yes, I think so too; and that is not the only issue with the patch > > (I don't think page_mapcount is enough, I had to use total_mapcount). Mike, we had the same instinctive reaction to seeing a PageTransCompound check in code also exposed to PageHuge pages; but in fact that seems to have worked correctly - those hugetlbfs pages are hard to predict! But it was not working on pte maps of THPs. > > > > It's a good find, and thank you WangYong for the report. > > I found the same issue when testing my MFD_HUGEPAGE patch last year, > > and devised a patch to fix it (and keep MFD_HUGETLB working) then; but > > never sent that in because there wasn't time to re-present MFD_HUGEPAGE. > > > > I'm currently retesting my patch: just found something failing which > > I thought should pass; but maybe I'm confused, or maybe the xarray is > > working differently now. I'm rushing to reply now because I don't want > > others to waste their own time on it. > > I did change how the XArray works for THP recently. > > Kirill's original patch stored: > > 512: p > 513: p+1 > 514: p+2 > ... > 1023: p+511 > > A couple of years ago, I changed it to store: > > 512: p > 513: p > 514: p > ... > 1023: p > > And in January, Linus merged the commit which changes it to: > > 512-575: p > 576-639: (sibling of 512) > 640-703: (sibling of 512) > ... > 960-1023: (sibling of 512) > > That is, I removed a level of the tree and store sibling entries > rather than duplicate entries. That wasn't for fun; I needed to do > that in order to make msync() work with large folios. Commit > 6b24ca4a1a8d has more detail and hopefully can inspire whatever > changes you need to make to your patch. Matthew, thanks for the very detailed info, you shouldn't have taken so much trouble over it: I knew you had done something of that kind, and yes, that's where my suspicion lay at the time of writing. But you'll be relieved to know that the patch I wrote before your changes turned out to be unaffected, and just as valid after your changes. "just found something failing which I thought should pass" was me forgetting, again and again, just how limited are the allowed possibilities for F_SEAL_WRITE when mmaps are outstanding. One thinks that a PROT_READ, MAP_SHARED mapping would be allowed; but of course all the memfds are automatically O_RDWR, so mprotect (no sealing hook) allows it to be changed to PROT_READ|PROT_WRITE, so F_SEAL_WRITE is forbidden on any MAP_SHARED mapping: only allowed on MAP_PRIVATEs. I'll now re-read the commit message I wrote before, update if necessary, and then send to Andrew, asking him to replace the one in this thread. Hugh