On Fri, Jul 07, 2023 at 08:54:33PM +0200, David Hildenbrand wrote: > On 07.07.23 19:26, Matthew Wilcox wrote: > > On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote: > > > This series identified the large folio for mlock to two types: > > > - The large folio is in VM_LOCKED VMA range > > > - The large folio cross VM_LOCKED VMA boundary > > > > This is somewhere that I think our fixation on MUST USE PMD ENTRIES > > has led us astray. Today when the arguments to mlock() cross a folio > > boundary, we split the PMD entry but leave the folio intact. That means > > that we continue to manage the folio as a single entry on the LRU list. > > But userspace may have no idea that we're doing this. It may have made > > several calls to mmap() 256kB at once, they've all been coalesced into > > a single VMA and khugepaged has come along behind its back and created > > a 2MB THP. Now userspace calls mlock() and instead of treating that as > > a hint that oops, maybe we shouldn't've done that, we do our utmost to > > preserve the 2MB folio. > > > > I think this whole approach needs rethinking. IMO, anonymous folios > > should not cross VMA boundaries. Tell me why I'm wrong. > > I think we touched upon that a couple of times already, and the main issue > is that while it sounds nice in theory, it's impossible in practice. > > THP are supposed to be transparent, that is, we should not let arbitrary > operations fail. > > But nothing stops user space from > > (a) mmap'ing a 2 MiB region > (b) GUP-pinning the whole range > (c) GUP-pinning the first half > (d) unpinning the whole range from (a) > (e) munmap'ing the second half > > > And that's just one out of many examples I can think of, not even > considering temporary/speculative references that can prevent a split at > random points in time -- especially when splitting a VMA. > > Sure, any time we PTE-map a THP we might just say "let's put that on the > deferred split queue" and cross fingers that we can eventually split it > later. (I was recently thinking about that in the context of the mapcount > ...) > > It's all a big mess ... Oh, I agree, there are always going to be circumstances where we realise we've made a bad decision and can't (easily) undo it. Unless we have a per-page pincount, and I Would Rather Not Do That. But we should _try_ to do that because it's the right model -- that's what I meant by "Tell me why I'm wrong"; what scenarios do we have where a user temporarilly mlocks (or mprotects or ...) a range of memory, but wants that memory to be aged in the LRU exactly the same way as the adjacent memory that wasn't mprotected? GUP-pinning is different, and I don't think GUP-pinning should split a folio. That's a temporary use (not FOLL_LONGTERM), eg, we're doing tcp zero-copy or it's the source/target of O_DIRECT. That's not an instruction that this memory is different from its neighbours. Maybe we end up deciding to split folios on GUP-pin. That would be regrettable.