Re: [patch 019/154] mm: make madvise(MADV_WILLNEED) support swap file prefetch

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Dec 20, 2013 at 03:36:19PM +0200, Kirill A. Shutemov wrote:
> Kirill A. Shutemov wrote:
> > Andrea Arcangeli wrote:
> > > Hi,
> > > 
> > > On Mon, Dec 16, 2013 at 10:18:39AM -0500, Sasha Levin wrote:
> > > > On 12/16/2013 07:47 AM, Kirill A. Shutemov wrote:
> > > > > I probably miss some context here. Do you have crash on some use-case or
> > > > > what? Could you point me to start of discussion.
> > > > 
> > > > Yes, Sorry, here's the crash that started this discussion originally:
> > > > 
> > > > The code points to:
> > > > 
> > > 
> > > At this point pmd_none_or_trans_huge_or_clear_bad guaranteed us the
> > > pmd points to a regular pte.
> > 
> > It took too long, but I finally found a way to reproduce the bug easily:
> > 
> > 	#define _GNU_SOURCE
> > 	#include <sys/mman.h>
> > 
> > 	#define MB (1024 * 1024)
> > 
> > 	int main(int argc, char **argv)
> > 	{
> > 		void *p;
> > 
> > 		p = mmap(0, 10 * MB, PROT_READ,
> > 				MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE,
> > 				-1, 0);
> > 		mprotect(p, 10 * MB, PROT_NONE);
> > 		madvise(p, 10 * MB, MADV_WILLNEED);
> > 		return 0;
> > 	}
> > 
> > And I track it down to pmd_none_or_trans_huge_or_clear_bad().
> > 
> > It seems it doesn't guarantee to return 1 for pmd_trans_huge() page and I
> > don't know how it suppose to do this for non-bad page.
> > 
> > I've fixed this with patch below.
> > 
> > Andrea, do I miss something important here or
> > pmd_none_or_trans_huge_or_clear_bad() is broken from day 1?
> 
> [ resend with fixed mail headers. ]
> 
> Oh.. It seems it cased by change pmd_bad() behaviour by commit be3a728427a6, so
> pmd_none_or_trans_huge_or_clear_bad() misses THP pmds if they are pmd_numa().
> 
> Other way to get it work is below. I'm not sure which is more correct (if any).

Yes sorry, it was the addition to the pte_numa that caused this
problem and it can only happen if booted on real NUMA hardware as
otherwise the _PAGE_NUMA wouldn't be armed.

This is further confirmed by the:

CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
CONFIG_NUMA_BALANCING=y

This is why it went unnoticed as it cannot happen on non NUMA hardware
no matter what.

Clearly the lockptr was zero beacuse the pmd was actually pointing to
a hugepage, not to a pte, otherwise the pmd_page(*pmd) wouldn't have
been allocated in huge_memory.c cow.

> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index f330d28e4d0e..1f8bc7881bdb 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -558,6 +558,14 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
>  }
>  #endif
>  
> +#ifndef pmd_numa
> +static inline int pmd_numa(pmd_t pmd)
> +{
> +	return (pmd_flags(pmd) &
> +		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
> +}
> +#endif
> +
>  /*
>   * This function is meant to be used by sites walking pagetables with
>   * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
> @@ -601,7 +609,7 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
>  #endif
>  	if (pmd_none(pmdval))
>  		return 1;
> -	if (unlikely(pmd_bad(pmdval))) {
> +	if (unlikely(pmd_bad(pmdval) || pmd_numa(pmdval))) {
>  		if (!pmd_trans_huge(pmdval))
>  			pmd_clear_bad(pmd);
>  		return 1;
> @@ -650,14 +658,6 @@ static inline int pte_numa(pte_t pte)
>  }
>  #endif
>  
> -#ifndef pmd_numa
> -static inline int pmd_numa(pmd_t pmd)
> -{
> -	return (pmd_flags(pmd) &
> -		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
> -}
> -#endif
> -

If we should do the below I would suggest Mel to decide.

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 3d19994..d70235b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -532,8 +532,10 @@ static inline int pmd_bad(pmd_t pmd)
 {
 #ifdef CONFIG_NUMA_BALANCING
 	/* pmd_numa check */
-	if ((pmd_flags(pmd) & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)
-		return 0;
+	if ((pmd_flags(pmd) & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA) {
+		pmd_clear_flags(pmd, _PAGE_NUMA);
+		pmd_set_flags(pmd, _PAGE_PRESENT);
+	}
 #endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }


The reason for not doing this was that it felt slow to have that kind
of mangling in a fast path bad check. But maybe it's no problem to do
it. The above should also avoid the bug.

However I would suggest the below, it was overkill strict to depend on
pmd_bad to fail, and overall it made it weaker to depend on debug
checks that may change in the future and that can provide false
negatives (they only should never provide false positives). No need to
check pmd_trans_huge inside the branch, the pmdval is stable there.

If in a later patch I'd suggest you to cleanup the barrier with a
ACCESS_ONCE (ACCESS_ONCE conditional only to
CONFIG_TRANSPARENT_HUGEPAGE, no need of it if that's not configured
in) that could give gcc more room for optimizations in this function.

If CONFIG_TRANSPARENT_HUGEPAGE is not set, if pmd_none fails, the pmd
cannot change at all.

Again for the pmd_bad change above it's up to you... it's purely a
performance/reliability tradeoff.

>From 238436f53937ed0a6821aa380b85366e4f6ad166 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Date: Fri, 20 Dec 2013 18:24:49 +0100
Subject: [PATCH] mm: thp: don't depend on pmd_bad to fail on transhuge pmds

pmd_bad stopped failing on transhuge pmds if the pmd_numa is
armed. Don't depend on that invariant anymore as it has been broken.

pmd_bad doesn't actually check if the pmd is bad if the pmd_numa is
set. An alternative could be to teach it to mangle the local pmd to
convert it from pmd_numa to non-pmd_numa and then issue a real check
on it. However that would likely be slower, and keeping the
pmd_trans_huge check in pmd_none_or_trans_huge_or_clear_bad in an
unlikely pmd_bad branch was also not optimal.

Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
---
 include/asm-generic/pgtable.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f330d28..7c2c752 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -599,11 +599,10 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	barrier();
 #endif
-	if (pmd_none(pmdval))
+	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
 		return 1;
 	if (unlikely(pmd_bad(pmdval))) {
-		if (!pmd_trans_huge(pmdval))
-			pmd_clear_bad(pmd);
+		pmd_clear_bad(pmd);
 		return 1;
 	}
 	return 0;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]