On Tue, Mar 25, 2014 at 01:16:02PM -0700, Linus Torvalds wrote: > On Mon, Mar 24, 2014 at 8:31 AM, Steven Noonan <steven@xxxxxxxxxxxxxx> wrote: > > Vrabel's comments make me think that revisiting the elimination of the > > _PAGE_NUMA bit implementation would be a good idea... should I CC you > > on this discussion (not sure if you're subscribed to xen-devel, or if > > LKML is a better place for that discussion)? > > I detest the PAGE_NUMA games myself, but: > First of all, sorry for the slow response even by my standards. I was at LSF/MM and Collaboration all last week and it took up all my attention. Today is my first day back properly online and trawling through the inbox mess. > From: David Vrabel <david.vrabel@xxxxxxxxxx>: > > > > I really do not understand how you're supposed to distinguish between a > > PTE for a PROT_NONE page and one with _PAGE_NUMA -- they're identical. > > i.e., pte_numa() will return true for a PROT_NONE protected page which > > just seems wrong to me. > > The way to distinguish PAGE_NUMA from PROTNONE is *supposed* to be by > looking at the vma, and PROTNONE goes together with a vma with > PROT_NONE. That's what the comments in pgtable_types.h say. > This is the expectation. We did not want to even attempt tracking NUMA hints on a per-VMA basis because the fault handler would go to hell with the need to fixup vmas. > However, as far as I can tell, that is pure and utter bullshit. It's > true that generally handle_mm_fault() shouldn't be called for > PROT_NONE pages, since it will fail the protection checks. However, we > have FOLL_FORCE that overrides those protection checks for things like > ptrace etc. So people have tried to convince me that _PAGE_NUMA works, > but I'm really not at all sure they are right. > For FOLL_FORCE, we do not set FOLL_NUMA in this chunk here /* * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault * would be called on PROT_NONE ranges. We must never invoke * handle_mm_fault on PROT_NONE ranges or the NUMA hinting * page faults would unprotect the PROT_NONE ranges if * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd * bitflag. So to avoid that, don't set FOLL_NUMA if * FOLL_FORCE is set. */ if (!(gup_flags & FOLL_FORCE)) gup_flags |= FOLL_NUMA; Without FOLL_NUMA, we do not do "pmd_numa" checks because they cannot distinguish between a prot_none and pmd_numa as they use identical bits on x86. This is in follow_page_mask if ((flags & FOLL_NUMA) && pmd_numa(*pmd)) goto no_page_table; Without the checks FOLL_FORCE would screw up when it encountered a page protected for NUMA hinting faults. I recognise that it further muddies the waters on what _PAGE_NUMA actually means. A potential alternative would have been to have two pte bits -- _PAGE_NONE and an unused PTE bit (if there is one) that we'd call_PAGE_NUMA where a pmd_mknuma sets both. The _PAGE_NONE is what would cause a hinting fault but we'd use the second bit to distinguish between PROT_NONE and a NUMA hinting fault. I doubt the end result would be much cleaner though and it would be a mess. Another alternative is to simply not allow NUMA_BALANCING on Xen. It's not even clear what it means as the Xen NUMA topology may or may not correspond to underlying physical nodes. It's even less clear what happens if both guest and host use automatic balancing. > I fundamentally think that it was a horrible horrible disaster to make > _PAGE_NUMA alias onto _PAGE_PROTNONE. > We did not have much of a choice. We needed something that would trap a fault and _PAGE_PROTNONE is not available on all architectures. ppc64 reused _PAGE_COHERENT for example. > But I'm cc'ing the people who tried to convince me otherwise last time > around, to see if they can articulate this mess better now. > > The argument *seems* to be that if things are truly PROT_NONE, then > the page will never be touched by page faulting code (and as > mentioned, I think that argument is fundamentally broken), and if it's > PROT_NUMA then the page faulting code will magically do the right > thing. > This is essentially the argument with the addendum that follow_page is meant to avoid trying pmd_numa checks on FOLL_FORCE. > FURTHERMORE, the argument was that we can't just call things PROT_NONE > and just say that "those are the semantics", because on other > architectures PROT_NONE might mean/do something else. Or that the equivalent of _PAGE_PROTNONE did not exist and was implemented by some other means. > Which really > makes no sense either, because if the argument was that PROT_NONE > causes faults that can either be handled as faults (for PROT_NONE) or > as NUMA issues (for NUMA), then dammit, that argument should be > completely architecture-independent. > > But I gave up arguing with people. I will state (again) that I think > this is a f*cking mess, and saying that PROTNONE and NUMA are somehow > the exact same thing on x86 but not in general is bogus crap. And > saying that you can determine which it is from the vma is very > debatable too. > Ok, so how do you suggest that _PAGE_NUMA could have been implemented that did *not* use _PAGE_PROTNONE on x86, trapped a fault and was not expensive as hell to handle? -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>