On Tue, Jul 02, 2019 at 09:07:28AM +0530, Anshuman Khandual wrote: > On 06/28/2019 03:50 PM, Catalin Marinas wrote: > > On Thu, Jun 27, 2019 at 06:18:15PM +0530, Anshuman Khandual wrote: > >> pmd_present() and pmd_trans_huge() are expected to behave in the following > >> manner during various phases of a given PMD. It is derived from a previous > >> detailed discussion on this topic [1] and present THP documentation [2]. > >> > >> pmd_present(pmd): > >> > >> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd) > >> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd) > >> > >> pmd_trans_huge(pmd): > >> > >> - Returns true if pmd refers to system RAM and is a trans huge mapping [...] > > Before we actually start fixing this, I would strongly suggest that you > > add a boot selftest (see lib/Kconfig.debug for other similar cases) > > which checks the consistency of the page table macros w.r.t. the > > expected mm semantics. Once the mm maintainers agreed with the > > semantics, it will really help architecture maintainers in implementing > > them correctly. > > Sure and it will help all architectures to be in sync wrt semantics. > > > You wouldn't need actual page tables, just things like assertions on > > pmd_trans_huge(pmd_mkhuge(pmd)) == true. You could go further and have > > checks on pmdp_invalidate(&dummy_vma, dummy_addr, &dummy_pmd) with the > > dummy_* variables on the stack. > > Hmm. I guess macros which operate directly on a page table entry will be > okay but the ones which check on specific states for VMA or MM might be > bit tricky. Try to emulate VMA/MM states while on stack ?. But sure, will > explore adding such a test. You can pretend that the page table is on the stack. See the _pmd variable in do_huge_pmd_wp_page_fallback() and __split_huge_zero_page_pmd(). Similarly, the vma and even the mm can be faked on the stack (see the arm64 tlb_flush()). > >> The problem: > >> > >> PMD is first invalidated with pmdp_invalidate() before it's splitting. This > >> invalidation clears PMD_SECT_VALID as below. > >> > >> PMD Split -> pmdp_invalidate() -> pmd_mknotpresent -> Clears PMD_SECT_VALID > >> > >> Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false > >> on the PMD entry. > > > > I think that's an inconsistency in the expected semantics here. Do you > > mean that pmd_present(pmd_mknotpresent(pmd)) should be true? If not, do [...] > pmd_present() and pmd_mknotpresent() are not exact inverse. I find this very confusing (not your fault, just the semantics expected by the core code). I can see that x86 is using _PAGE_PSE to make pmd_present(pmd_mknotpresent()) == true. However, for pud that's not the case (because it's not used for transhuge). I'd rather have this renamed to pmd_mknotvalid(). > In absence of a positive section mapping bit on arm64, PTE_SPECIAL is being set > temporarily to remember that it was a mapped PMD which got invalidated recently > but which still points to memory. Hence pmd_present() must evaluate true. I wonder if we can encode this safely for arm64 in the bottom two bits of a pmd : 0b00 - not valid, not present 0b10 - not valid, present, huge 0b01 - valid, present, huge 0b11 - valid, table (not huge) Do we ever call pmdp_invalidate() on a table entry? I don't think we do. So a pte_mknotvalid would set bit 1 and I think swp_entry_to_pmd() would have to clear it so that pmd_present() actually returns false for a swp pmd entry. > > we need to implement our own pmdp_invalidate() or change the generic one > > to set a "special" bit instead of just a pmd_mknotpresent? > > Though arm64 can subscribe __HAVE_ARCH_PMDP_INVALIDATE and implement it's own > pmdp_invalidate() in order to not call pmd_mknotpresent() and instead operate > on the invalid and special bits directly. But its not going to alter relevant > semantics here. AFAICS it might be bit better as it saves pmd_mknotpresent() > from putting in that special bit in there which it is not supposed do. > > IFAICS there is no compelling reason for generic pmdp_invalidate() to change > either. It calls pmd_mknotpresent() which invalidates the entry through valid > or present bit and platforms which have dedicated huge page bit can still test > positive for pmd_present() after it's invalidation. It works for such platforms. > Platform specific override is required when invalidation via pmd_mknotpresent() > is not enough. I'd really like the mknotpresent to be renamed to mknotvalid and then we can keep pmdp_invalidate unchanged (well, calling mknotvalid instead). -- Catalin