Re: [PATCH 0/2] Move away from non-failing small allocations

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Thu, 2 Apr 2015 20:53:05 +0900

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > We are seeing issues with the fs code now because the test cases which
> > led to the current discussion exercise FS code. The code which does
> > lock(); kmalloc(GFP_KERNEL) is not reduced there though. I am pretty sure
> > we can find other subsystems if we try hard enough.
> 
> I'm expecting for patches which avoids deadlock by lock(); kmalloc(GFP_KERNEL).
> 
> > > static inline void enter_fs_code(struct super_block *sb)
> > > {
> > > 	if (sb->my_small_allocations_can_fail)
> > > 		current->small_allocations_can_fail++;
> > > }
> > > 
> > > that way (or something similar) we can select the behaviour on a per-fs
> > > basis and the rest of the kernel remains unaffected.  Other subsystems
> > > can opt in as well.
> > 
> > This is basically leading to GFP_MAYFAIL which is completely backwards
> > (the hard requirement should be an exception not a default rule).
> > I really do not want to end up with stuffing random may_fail annotations
> > all over the kernel.
> > 
> 
> I wish that GFP_NOFS / GFP_NOIO regions are annotated with
> 
>   static inline void enter_fs_code(void)
>   {
>   #ifdef CONFIG_DEBUG_GFP_FLAGS
>   	current->in_fs_code++;
>   #endif
>   }
> 
>   static inline void leave_fs_code(void)
>   {
>   #ifdef CONFIG_DEBUG_GFP_FLAGS
>   	current->in_fs_code--;
>   #endif
>   }
> 
>   static inline void enter_io_code(void)
>   {
>   #ifdef CONFIG_DEBUG_GFP_FLAGS
>   	current->in_io_code++;
>   #endif
>   }
> 
>   static inline void leave_io_code(void)
>   {
>   #ifdef CONFIG_DEBUG_GFP_FLAGS
>   	current->in_io_code--;
>   #endif
>   }
> 
> so that inappropriate GFP_KERNEL usage inside GFP_NOFS region are catchable
> by doing
> 
>   struct page *
>   __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>                           struct zonelist *zonelist, nodemask_t *nodemask)
>   {
>   	struct zoneref *preferred_zoneref;
>   	struct page *page = NULL;
>   	unsigned int cpuset_mems_cookie;
>   	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
>   	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
>   	struct alloc_context ac = {
>   		.high_zoneidx = gfp_zone(gfp_mask),
>   		.nodemask = nodemask,
>   		.migratetype = gfpflags_to_migratetype(gfp_mask),
>   	};
>   	
>   	gfp_mask &= gfp_allowed_mask;
>  +#ifdef CONFIG_DEBUG_GFP_FLAGS
>  +	WARN_ON(current->in_fs_code & (gfp_mask & __GFP_FS));
>  +	WARN_ON(current->in_io_code & (gfp_mask & __GFP_IO));
>  +#endif
>   
>   	lockdep_trace_alloc(gfp_mask);
>   
> 
> . It is difficult for non-fs developers to determine whether they need to use
> GFP_NOFS than GFP_KERNEL in their code. An example is seen at
> http://marc.info/?l=linux-security-module&m=138556479607024&w=2 .

I didn't receive responses about idea of annotating GFP_NOFS/GFP_NOIO
sections. Therefore, I wrote a patch shown below.

Dave and Michal, can we assume that a thread which initiated a
GFP_NOFS/GFP_NOIO section always terminates that section (as with
rcu_read_lock()/rcu_read_unlock()) ? If we can't assume it,
mask_gfp_flags() in below patch cannot work.
----------------------------------------
>From 0b18f21c9c9aef2353d355afc83b2a2193bbced7 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Date: Thu, 2 Apr 2015 16:47:47 +0900
Subject: [PATCH] Add macros for annotating GFP_NOFS/GFP_NOIO allocations.

While there are rules for when to use GFP_NOFS or GFP_NOIO, there are
locations where GFP_KERNEL is by error used. Such incorrect usage may
cause deadlock under memory pressure. But it is difficult for non-fs
developers to determine whether they need to use GFP_NOFS than GFP_KERNEL
in their code. Also, it is difficult to hit the deadlock caused by such
incorrect usage. Therefore, this patch introduces 4 macros for annotating
the duration where GFP_NOFS or GFP_NOIO needs to be used.

  enter_fs_code() is for annotating that allocations with __GFP_FS are
  not permitted until corresponding leave_fs_code() is called.

  leave_fs_code() is for annotating that allocations with __GFP_FS are
  permitted from now on.

  enter_io_code() is for annotating that allocations with __GFP_IO are
  not permitted until corresponding leave_io_code() is called.

  leave_io_code() is for annotating that allocations with __GFP_IO are
  permitted from now on.

These macros are no-op unless CONFIG_DEBUG_GFP_FLAGS is defined.

Note that this patch does not insert these macros into GFP_NOFS/GFP_NOIO
users. Inserting these macros with appropriate comments are up to
GFP_NOFS/GFP_NOIO users who know the dependency well.

This patch also introduces 1 function which makes use of these macros
when CONFIG_DEBUG_GFP_FLAGS is defined.

  mask_gfp_flags() is for checking and printing warnings when __GFP_FS is
  used between enter_fs_code() and leave_fs_code() or __GFP_IO is used
  between enter_io_code() and leave_io_code().

  If GFP_KERNEL allocation is requested between enter_fs_code() and
  leave_fs_code(), a warning message

    GFP_FS allocation ($gfp_mask) is unsafe for this context

  is reported with a backtrace.

  If GFP_KERNEL or GFP_NOFS allocation is requested between enter_io_code()
  and leave_io_code(), a warning message

    GFP_IO allocation ($gfp_mask) is unsafe for this context

  is reported with a backtrace.

Note that currently mask_gfp_flags() cannot work unless gfp_allowed_mask
is involved. That is, mask_gfp_flags() cannot work if memory allocation
can be handled by e.g. returning partial memory from already allocated
pages. Handling such cases is deferred to future patches.

Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
---
 include/linux/gfp.h   | 17 +++++++++++++++++
 include/linux/sched.h |  4 ++++
 mm/Kconfig.debug      |  7 +++++++
 mm/page_alloc.c       | 17 ++++++++++++++++-
 mm/slab.c             |  4 ++--
 mm/slob.c             |  4 ++--
 mm/slub.c             |  6 +++---
 7 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 51bd1e7..97654ff 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -421,4 +421,21 @@ extern void init_cma_reserved_pageblock(struct page *page);
 
 #endif
 
+#ifdef CONFIG_DEBUG_GFP_FLAGS
+#define enter_fs_code() { current->in_fs_code++; }
+#define leave_fs_code() { current->in_fs_code--; }
+#define enter_io_code() { current->in_io_code++; }
+#define leave_io_code() { current->in_io_code--; }
+extern gfp_t mask_gfp_flags(gfp_t gfp_mask);
+#else
+#define enter_fs_code() do { } while (0)
+#define leave_fs_code() do { } while (0)
+#define enter_io_code() do { } while (0)
+#define leave_io_code() do { } while (0)
+static inline gfp_t mask_gfp_flags(gfp_t gfp_mask)
+{
+	return gfp_mask & gfp_allowed_mask;
+}
+#endif
+
 #endif /* __LINUX_GFP_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a419b65..ebaf817 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1710,6 +1710,10 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+#ifdef CONFIG_DEBUG_GFP_FLAGS
+	unsigned char in_fs_code;
+	unsigned char in_io_code;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 957d3da..0f9ad6f 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -28,3 +28,10 @@ config DEBUG_PAGEALLOC
 
 config PAGE_POISONING
 	bool
+
+config DEBUG_GFP_FLAGS
+	bool "Check GFP flags passed to memory allocators"
+	depends on DEBUG_KERNEL
+	---help---
+	  Track and warn about incorrect use of __GFP_FS or __GFP_IO
+	  memory allocations.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40e2942..239b068 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2794,6 +2794,21 @@ got_pg:
 	return page;
 }
 
+#ifdef CONFIG_DEBUG_GFP_FLAGS
+gfp_t mask_gfp_flags(gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_WAIT) {
+		WARN((gfp_mask & __GFP_FS) && current->in_fs_code,
+		     "GFP_FS allocation (0x%x) is unsafe for this context\n",
+		     gfp_mask);
+		WARN((gfp_mask & __GFP_IO) && current->in_io_code,
+		     "GFP_IO allocation (0x%x) is unsafe for this context\n",
+		     gfp_mask);
+	}
+	return gfp_mask & gfp_allowed_mask;
+}
+#endif
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -2812,7 +2827,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		.migratetype = gfpflags_to_migratetype(gfp_mask),
 	};
 
-	gfp_mask &= gfp_allowed_mask;
+	gfp_mask = mask_gfp_flags(gfp_mask);
 
 	lockdep_trace_alloc(gfp_mask);
 
diff --git a/mm/slab.c b/mm/slab.c
index c4b89ea..8a73788 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3136,7 +3136,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	void *ptr;
 	int slab_node = numa_mem_id();
 
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 
 	lockdep_trace_alloc(flags);
 
@@ -3224,7 +3224,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	unsigned long save_flags;
 	void *objp;
 
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 
 	lockdep_trace_alloc(flags);
 
diff --git a/mm/slob.c b/mm/slob.c
index 94a7fed..c731eb4 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -430,7 +430,7 @@ __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
 	int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
 	void *ret;
 
-	gfp &= gfp_allowed_mask;
+	gfp = mask_gfp_flags(gfp);
 
 	lockdep_trace_alloc(gfp);
 
@@ -536,7 +536,7 @@ void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
 {
 	void *b;
 
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 
 	lockdep_trace_alloc(flags);
 
diff --git a/mm/slub.c b/mm/slub.c
index 82c4737..60fe0a9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1263,7 +1263,7 @@ static inline void kfree_hook(const void *x)
 static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 						     gfp_t flags)
 {
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 	lockdep_trace_alloc(flags);
 	might_sleep_if(flags & __GFP_WAIT);
 
@@ -1276,7 +1276,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 static inline void slab_post_alloc_hook(struct kmem_cache *s,
 					gfp_t flags, void *object)
 {
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 	kmemcheck_slab_alloc(s, flags, object, slab_ksize(s));
 	kmemleak_alloc_recursive(object, s->object_size, 1, s->flags, flags);
 	memcg_kmem_put_cache(s);
@@ -1339,7 +1339,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	struct kmem_cache_order_objects oo = s->oo;
 	gfp_t alloc_gfp;
 
-	flags &= gfp_allowed_mask;
+	flags = mask_gfp_flags(flags);
 
 	if (flags & __GFP_WAIT)
 		local_irq_enable();
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>