On 27 Apr 2024, at 16:45, Zi Yan wrote: > On 27 Apr 2024, at 15:11, John Hubbard wrote: > >> On 4/27/24 8:14 AM, Zi Yan wrote: >>> On 27 Apr 2024, at 0:41, John Hubbard wrote: >>>> On 4/25/24 10:07 AM, Ryan Roberts wrote: >>>>> __split_huge_pmd_locked() can be called for a present THP, devmap or >>>>> (non-present) migration entry. It calls pmdp_invalidate() >>>>> unconditionally on the pmdp and only determines if it is present or not >>>>> based on the returned old pmd. This is a problem for the migration entry >>>>> case because pmd_mkinvalid(), called by pmdp_invalidate() must only be >>>>> called for a present pmd. >>>>> >>>>> On arm64 at least, pmd_mkinvalid() will mark the pmd such that any >>>>> future call to pmd_present() will return true. And therefore any >>>>> lockless pgtable walker could see the migration entry pmd in this state >>>>> and start interpretting the fields as if it were present, leading to >>>>> BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. >>>>> I suspect the same is possible on other architectures. >>>>> >>>>> Fix this by only calling pmdp_invalidate() for a present pmd. And for >>>> >>>> Yes, this seems like a good design decision (after reading through the >>>> discussion that you all had in the other threads). >>> >>> This will only be good for arm64 and does not prevent other arch developers >>> to write code breaking arm64, since only arm64's pmd_mkinvalid() can turn >>> a swap entry to a pmd_present() entry. >> >> Well, let's characterize it in a bit more detail, then: >> >> 1) This patch will make things better for arm64. That's important! >> >> 2) Equally important, this patch does not make anything worse for >> other CPU arches. >> >> 3) This patch represents a new design constraint on the CPU arch >> layer, and thus requires documentation and whatever enforcement >> we can provide, in order to keep future code out of trouble. >> >> 3.a) See the VM_WARN_ON() hunks below. >> >> 3.b) I like the new design constraint, because it is reasonable and >> clearly understandable: don't invalidate a non-present page >> table entry. >> >> I do wonder if there is somewhere else that this should be documented? In terms of documentation, at least in Documentation/mm/arch_pgtable_helpers.rst, pmd_mkinvalid() entry needs to add "do not call on an invalid entry as it breaks arm64". > > The issue is pmd_mkinvalid(), since it turns a swap entry into a pmd_present() > entry on arm64. This patch only adds a warning on pmd_invalidate(), although > pmd_invalidate() is the only caller of pmd_mkinvalid(). This means any > future user of pmd_mkinvalid() can cause the same issue on arm64 without any > warning. > > I am not against changing the logic in __split_huge_pmd_lock() to fix arm64, > but just want to prevent future errors, that might only be possible on arm64. > > BTW, in terms of the patch itself, moving "pmdp_invalidate(vma, haddr, pmd)" > without moving the big comment above it is not OK, since later no one can > figure out why that comment is there. > > >> >> >> thanks, >> -- >> John Hubbard >> NVIDIA >> >> >>>> >>>>> good measure let's add a warning to the generic implementation of >>>>> pmdp_invalidate(). I've manually reviewed all other >>>>> pmdp_invalidate[_ad]() call sites and believe all others to be >>>>> conformant. >>>>> >>>>> This is a theoretical bug found during code review. I don't have any >>>>> test case to trigger it in practice. >>>>> >>>>> Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >>>>> --- >>>>> >>>>> Applies on top of v6.9-rc5. Passes all the mm selftests on arm64. >>>>> >>>>> Thanks, >>>>> Ryan >>>>> >>>>> >>>>> mm/huge_memory.c | 5 +++-- >>>>> mm/pgtable-generic.c | 2 ++ >>>>> 2 files changed, 5 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>>> index 89f58c7603b2..80939ad00718 100644 >>>>> --- a/mm/huge_memory.c >>>>> +++ b/mm/huge_memory.c >>>>> @@ -2513,12 +2513,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, >>>>> * for this pmd), then we flush the SMP TLB and finally we write the >>>>> * non-huge version of the pmd entry with pmd_populate. >>>>> */ >>>>> - old_pmd = pmdp_invalidate(vma, haddr, pmd); >>>>> >>>>> - pmd_migration = is_pmd_migration_entry(old_pmd); >>>>> + pmd_migration = is_pmd_migration_entry(*pmd); >>>>> if (unlikely(pmd_migration)) { >>>>> swp_entry_t entry; >>>>> >>>>> + old_pmd = *pmd; >>>>> entry = pmd_to_swp_entry(old_pmd); >>>>> page = pfn_swap_entry_to_page(entry); >>>>> write = is_writable_migration_entry(entry); >>>>> @@ -2529,6 +2529,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, >>>>> soft_dirty = pmd_swp_soft_dirty(old_pmd); >>>>> uffd_wp = pmd_swp_uffd_wp(old_pmd); >>>>> } else { >>>>> + old_pmd = pmdp_invalidate(vma, haddr, pmd); >>>> >>>> This looks good, except that now I am deeply confused about the pre-existing >>>> logic. I thought that migration entries were a subset of swap entries, >>>> but this code seems to be treating is_pmd_migration_entry() as a >>>> synonym for "is a swap entry". Can you shed any light on this for me? >>> >>> It is likely because kernel only split present pmd and migration pmd, but I >>> could be wrong since the code is changed a lot since splitting migration >>> pmd was added. We either need to check all call sites or check pmd_present() >>> instead of is_pmd_migration_entry() and handle all possible situations. >>> >>>> >>>> >>>>> page = pmd_page(old_pmd); >>>>> folio = page_folio(page); >>>>> if (pmd_dirty(old_pmd)) { >>>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c >>>>> index 4fcd959dcc4d..74e34ea90656 100644 >>>>> --- a/mm/pgtable-generic.c >>>>> +++ b/mm/pgtable-generic.c >>>>> @@ -198,6 +198,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) >>>>> pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, >>>>> pmd_t *pmdp) >>>>> { >>>>> + VM_WARN_ON(!pmd_present(*pmdp)); >>>>> pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); >>>>> flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); >>>>> return old; >>>>> @@ -208,6 +209,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, >>>>> pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, >>>>> pmd_t *pmdp) >>>>> { >>>>> + VM_WARN_ON(!pmd_present(*pmdp)); >>>> >>>> Should these be VM_WARN_ON_ONCE(), instead? >>>> >>>> Also, this seems like a good place to put a little comment in, to mark the >>>> new design constraint. Something like "Only present entries are allowed >>>> to be invalidated", perhaps. >>>> >>>> >>>>> return pmdp_invalidate(vma, address, pmdp); >>>>> } >>>>> #endif >>>>> -- >>>>> 2.25.1 >>>>> >>>>> >>>> >>>> thanks, >>>> -- >>>> John Hubbard >>>> NVIDIA >>> >>> >>> -- >>> Best Regards, >>> Yan, Zi > > > -- > Best Regards, > Yan, Zi -- Best Regards, Yan, Zi
Attachment:
signature.asc
Description: OpenPGP digital signature