Re: + mm-numa-fix-bad-pmd-by-atomically-check-for-pmd_trans_huge-when-marking-page-tables-prot_numa.patch added to -mm tree

Rafael Aquini <aquini@xxxxxxxxxx> · Wed, 19 Feb 2020 17:18:30 -0500

On Wed, Feb 19, 2020 at 01:57:35PM -0800, akpm@xxxxxxxxxxxxxxxxxxxx wrote:
> 
> The patch titled
>      Subject: mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
> has been added to the -mm tree.  Its filename is
>      mm-numa-fix-bad-pmd-by-atomically-check-for-pmd_trans_huge-when-marking-page-tables-prot_numa.patch
> 
> This patch should soon appear at
>     http://ozlabs.org/~akpm/mmots/broken-out/mm-numa-fix-bad-pmd-by-atomically-check-for-pmd_trans_huge-when-marking-page-tables-prot_numa.patch
> and later at
>     http://ozlabs.org/~akpm/mmotm/broken-out/mm-numa-fix-bad-pmd-by-atomically-check-for-pmd_trans_huge-when-marking-page-tables-prot_numa.patch
> 
> Before you just go and hit "reply", please:
>    a) Consider who else should be cc'ed
>    b) Prefer to cc a suitable mailing list as well
>    c) Ideally: find the original patch on the mailing list and do a
>       reply-to-all to that, adding suitable additional cc's
> 
> *** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
> 
> The -mm tree is included into linux-next and is updated
> there every 3-4 working days
> 
> ------------------------------------------------------
> From: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> Subject: mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
> 
> : A user reported a bug against a distribution kernel while running a
> : proprietary workload described as "memory intensive that is not swapping"
> : that is expected to apply to mainline kernels.  The workload is
> : read/write/modifying ranges of memory and checking the contents.  They
> : reported that within a few hours that a bad PMD would be reported followed
> : by a memory corruption where expected data was all zeros.  A partial
> : report of the bad PMD looked like
> : 
> :   [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
> :   [ 5195.341184] ------------[ cut here ]------------
> :   [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
> :   ....
> :   [ 5195.410033] Call Trace:
> :   [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
> :   [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
> :   [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
> :   [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
> :   [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
> :   [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
> :   [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
> : 
> : Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
> : was a false detection.  The bug does not trigger if automatic NUMA
> : balancing or transparent huge pages is disabled.
> : 
> : The bug is due a race in change_pmd_range between a pmd_trans_huge and
> : pmd_nond_or_clear_bad check without any locks held.  During the
> : pmd_trans_huge check, a parallel protection update under lock can have
> : cleared the PMD and filled it with a prot_numa entry between the transhuge
> : check and the pmd_none_or_clear_bad check.
> : 
> : While this could be fixed with heavy locking, it's only necessary to make
> : a copy of the PMD on the stack during change_pmd_range and avoid races.  A
> : new helper is created for this as the check if quite subtle and the
> : existing similar helpful is not suitable.  This passed 154 hours of
> : testing (usually triggers between 20 minutes and 24 hours) without
> : detecting bad PMDs or corruption.  A basic test of an autonuma-intensive
> : workload showed no significant change in behaviour.
> 
> Although Mel withdrew the patch on the face of LKML comment
> https://lkml.org/lkml/2017/4/10/922 the race window aforementioned is
> still open, and we have reports of Linpack test reporting bad residuals
> after the bad PMD warning is observed.  In addition to that, bad
> rss-counter and non-zero pgtables assertions are triggered on mm teardown
> for the task hitting the bad PMD.
> 
>  host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
>  ....
>  host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
>  host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096
> 
> The issue is observed on a v4.18-based distribution kernel, but the race
> window is expected to be applicable to mainline kernels, as well.
> 
> Link: http://lkml.kernel.org/r/20200216191800.22423-1-aquini@xxxxxxxxxx
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Rafael Aquini <aquini@xxxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxx>
> Cc: Vlastimil Babka <vbabka@xxxxxxx>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> ---
> 
>  mm/mprotect.c |   38 ++++++++++++++++++++++++++++++++++++--
>  1 file changed, 36 insertions(+), 2 deletions(-)
> 
> --- a/mm/mprotect.c~mm-numa-fix-bad-pmd-by-atomically-check-for-pmd_trans_huge-when-marking-page-tables-prot_numa
> +++ a/mm/mprotect.c
> @@ -161,6 +161,31 @@ static unsigned long change_pte_range(st
>  	return pages;
>  }
>  
> +/*
> + * Used when setting automatic NUMA hinting protection where it is
> + * critical that a numa hinting PMD is not confused with a bad PMD.
> + */
> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> +{
> +	pmd_t pmdval = pmd_read_atomic(pmd);
> +
> +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	barrier();
> +#endif
> +
> +	if (pmd_none(pmdval))
> +		return 1;
> +	if (pmd_trans_huge(pmdval))
> +		return 0;
> +	if (unlikely(pmd_bad(pmdval))) {
> +		pmd_clear_bad(pmd);
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
>  static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		pud_t *pud, unsigned long addr, unsigned long end,
>  		pgprot_t newprot, int dirty_accountable, int prot_numa)
> @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_r
>  		unsigned long this_pages;
>  
>  		next = pmd_addr_end(addr, end);
> -		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> -				&& pmd_none_or_clear_bad(pmd))
> +
> +		/*
> +		 * Automatic NUMA balancing walks the tables with mmap_sem
> +		 * held for read. It's possible a parallel update to occur
> +		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
> +		 * check leading to a false positive and clearing.
> +		 * Hence, it's ecessary to atomically read the PMD value
                               ^^^^^^^^
Andrew, 

Sorry I only noticed now, but there's a minor typo in he comment above.
It should read "necessary" instead.

Do you need me reposting?

-- Rafael

> +		 * for all the checks.
> +		 */
> +		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
> +		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
>  			goto next;
>  
>  		/* invoke the mmu notifier if the pmd is populated */
> _
> 
> Patches currently in -mm which might be from mgorman@xxxxxxxxxxxxxxxxxxx are
> 
> mm-numa-fix-bad-pmd-by-atomically-check-for-pmd_trans_huge-when-marking-page-tables-prot_numa.patch
>