Re: PROBLEM: Memory leaking when running kubernetes cronjobs

Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx> · Mon, 8 Oct 2018 15:09:04 +0300

Hi Dan,

On Mon, Oct 08, 2018 at 11:35:12AM +0100, Daniel McGinnes wrote:
> Hi Mike,
> 
> good finds!
> 
> I let my run with IPv6 disabled run over the weekend and after ~4 days 
> there has been no increase in Percpu memory and MemAvailable has remained 
> flat - so seems fairly conclusive that IPv6 was causing the remaining 
> leak.
> 
> Will be interesting to see if that fix you found resolves the issue when 
> IPv6 is re-enabled.

It would be interesting to see what are the users of the percpu memory with
the patch I've sent earlier. It outputs info similar to /proc/vmallocinfo
into /sys/kernel/debug/percpu_users. 

I didn't have a chance to verify that it's completely stable under stress,
though.

> thanks,
> 
> Dan McGinnes
> 
> IBM Cloud - Containers performance
> 
> Int Tel: 247359        Ext Tel: 01962 817359
> 
> Notes: Daniel McGinnes/UK/IBM
> Email: MCGINNES@xxxxxxxxxx
> 
> IBM (UK) Ltd, Hursley Park,Winchester,Hampshire, SO21 2JN
> 
> 
> 
> From:   "Mike Rapoport" <rppt@xxxxxxxxxxxxxxxxxx>
> To:     "Roman Gushchin" <guro@xxxxxx>
> Cc:     Daniel McGinnes/UK/IBM@IBMGB, "cgroups@xxxxxxxxxxxxxxx" 
> <cgroups@xxxxxxxxxxxxxxx>, "Nathaniel Rockwell" <nrockwell@xxxxxxxxxx>
> Date:   08/10/2018 08:05
> Subject:        Re: PROBLEM: Memory leaking when running kubernetes 
> cronjobs
> 
> 
> 
> On Sat, Oct 06, 2018 at 12:42:37AM +0000, Roman Gushchin wrote:
> > Hi Daniel!
> > 
> > On Fri, Oct 05, 2018 at 10:16:25AM +0000, Daniel McGinnes wrote:
> > > Hi Roman,
> > > 
> > > memory pressure was started after 1 hour (Ran stress --vm 16 
> --vm-bytes 
> > > 1772864000 -t 300 for 5 minutes, then sleep for 5 mins in a continuous 
> 
> > > loop).
> > > 
> > > Machine has 16 cores & 32 GB RAM.
> > > 
> > > I think the issue I still have is that even though the per-cpu is able 
> to 
> > > be reused for other per-cpu allocations, my understanding is that it 
> will 
> > > not be available for general use by applications - so if percpu memory 
> 
> > > usage is growing continuously (which we still see happening pretty 
> slowly 
> > > - but over months it would be fairly significant) it means there will 
> be 
> > > less memory available for applications to use. Please let me know if 
> I've 
> > > mis-understood something here.
> > 
> > Well, yeah, not looking good.
> > 
> > > 
> > > After seeing several stacks in IPv6 in the memory leak output I ran a 
> test 
> > > with IPv6 disabled on the host. Interestingly after 24 hours the 
> Percpu 
> > > memory reported in meminfo seems to have flattened out, whereas with 
> IPv6 
> > > enabled it was still growing. MemAvailable is decreasing so slowly 
> that I 
> > > need to leave it longer to draw any conclusions from that.
> > 
> > Looks like there is a independent per-cpu memory leak somewhere in the 
> ipv6
> > stack. Not sure, of course, but if the number of dying cgroups is not 
> growing...
> 
> There is a leak in the percpu allocator itself, it never frees some of its
> metadata. I've sent the fix yesterday [1], I believe it will be merged in
> 4.19.
> 
> Also, there was a recent fix for a leak ipv6 [2].
> 
> I'm now trying to see the dynamics of the  percpu allocations, so I've
> added yet another debugfs interface for percpu (below) similar to
> /proc/vmallocinfo. I hope that by the end of the day I'll be able to see
> what is causing to increase in percpu memory.
>  
> > Id try to check that my memleak.py is actually capturing all allocs, 
> maybe
> > we're missing something...
> > 
> > Thanks!
> > 
> 
> [1] https://lkml.org/lkml/2018/10/7/84
> [2] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce7ea4af0838ffd4667ecad4eb5eec7a25342f1e
> 
> 
> 
> From 3979853b25b3cd1e0ba0d5242c2145f63afd1ad1 Mon Sep 17 00:00:00 2001
> From: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx>
> Date: Sun, 7 Oct 2018 19:41:17 +0300
> Subject: [PATCH] percpu: stats: add info about memory users
> 
> Print summary of allocation count, size and callers, similar to
> /proc/vmallocinfo.
> 
> Signed-off-by: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx>
> ---
>  mm/percpu-internal.h | 28 +++++++++++++++---
>  mm/percpu-stats.c    | 82 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/percpu.c          | 28 ++++++++++++++----
>  3 files changed, 128 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index b1739dc..b336ae4 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -21,10 +21,17 @@ struct pcpu_block_md {
>                  int                     first_free;     /* block position 
> of first free */
>  };
>  
> +struct pcpu_allocs_info {
> +                size_t size;
> +                int count;
> +                const void *caller;
> +};
> +
>  struct pcpu_chunk {
>  #ifdef CONFIG_PERCPU_STATS
>                  int                                             nr_alloc; 
>          /* # of allocations */
>                  size_t max_alloc_size; /* largest allocation size */
> +                struct pcpu_allocs_info *allocs;
>  #endif
>  
>                  struct list_head                list;  /* linked to 
> pcpu_slot lists */
> @@ -136,8 +143,11 @@ static inline void pcpu_stats_save_ai(const struct 
> pcpu_alloc_info *ai)
>   * CONTEXT:
>   * pcpu_lock.
>   */
> -static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, size_t 
> size)
> +static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, size_t 
> size,
> +   int off, const void *caller)
>  {
> +                int idx = off / PCPU_MIN_ALLOC_SIZE;
> +
>                  lockdep_assert_held(&pcpu_lock);
>  
>                  pcpu_stats.nr_alloc++;
> @@ -151,6 +161,10 @@ static inline void pcpu_stats_area_alloc(struct 
> pcpu_chunk *chunk, size_t size)
>  
>                  chunk->nr_alloc++;
>                  chunk->max_alloc_size = max(chunk->max_alloc_size, size);
> +
> +                chunk->allocs[idx].size = size;
> +                chunk->allocs[idx].caller = caller;
> +                chunk->allocs[idx].count = 1;
>  }
>  
>  /*
> @@ -160,14 +174,19 @@ static inline void pcpu_stats_area_alloc(struct 
> pcpu_chunk *chunk, size_t size)
>   * CONTEXT:
>   * pcpu_lock.
>   */
> -static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk)
> +static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk, int 
> off)
>  {
> +                int idx = off / PCPU_MIN_ALLOC_SIZE;
> +
>                  lockdep_assert_held(&pcpu_lock);
>  
>                  pcpu_stats.nr_dealloc++;
>                  pcpu_stats.nr_cur_alloc--;
>  
>                  chunk->nr_alloc--;
> +
> +                chunk->allocs[idx].size = 0;
> +                chunk->allocs[idx].caller = NULL;
>  }
>  
>  /*
> @@ -204,11 +223,12 @@ static inline void pcpu_stats_save_ai(const struct 
> pcpu_alloc_info *ai)
>  {
>  }
>  
> -static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, size_t 
> size)
> +static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, size_t 
> size,
> +   int off, const void *caller)
>  {
>  }
>  
> -static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk)
> +static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk, int 
> off)
>  {
>  }
>  
> diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
> index b5fdd43..c587393 100644
> --- a/mm/percpu-stats.c
> +++ b/mm/percpu-stats.c
> @@ -44,6 +44,34 @@ static int find_max_nr_alloc(void)
>                  return max_nr_alloc;
>  }
>  
> +static void get_chunk_alloc_info(struct pcpu_chunk *chunk,
> +                                                                 struct 
> pcpu_allocs_info *allocs)
> +{
> +                int region_bits = pcpu_chunk_map_bits(chunk);
> +                int i, j;
> +
> +                for (i = 0; i < region_bits; i++) {
> +                                struct pcpu_allocs_info *ai = 
> &chunk->allocs[i];
> +
> +                                if (!ai->size)
> +                                                continue;
> +
> +                                for (j = 0; allocs[j].size; j++) {
> +                                                if (allocs[j].size == 
> ai->size &&
> +                                                    allocs[j].caller == 
> ai->caller) {
> + allocs[j].count++;
> +                                                                break;
> +                                                }
> +                                }
> +
> +                                if (!allocs[j].size) {
> +                                                allocs[j].size = 
> ai->size;
> +                                                allocs[j].caller = 
> ai->caller;
> +                                                allocs[j].count = 1;
> +                                }
> +                }
> +}
> +
>  /*
>   * Prints out chunk state. Fragmentation is considered between
>   * the beginning of the chunk to the last allocation.
> @@ -225,11 +253,65 @@ static int percpu_stats_show(struct seq_file *m, 
> void *v)
>  }
>  DEFINE_SHOW_ATTRIBUTE(percpu_stats);
>  
> +static int percpu_users_show(struct seq_file *m, void *v)
> +{
> +                struct pcpu_allocs_info *allocs, *sum, *cur;
> +                struct pcpu_chunk *chunk;
> +                int unit_pages, slot_users, max_users;
> +                int slot, i;
> +
> +                unit_pages = pcpu_stats_ai.unit_size >> PAGE_SHIFT;
> +                slot_users = pcpu_nr_pages_to_map_bits(unit_pages);
> +                max_users = slot_users * (pcpu_nr_slots + 1);
> +
> +                allocs = vzalloc(array_size(sizeof(*allocs), max_users));
> +                if (!allocs)
> +                                return -ENOMEM;
> +
> +                spin_lock_irq(&pcpu_lock);
> +
> +                for (slot = 0; slot < pcpu_nr_slots; slot++)
> +                                list_for_each_entry(chunk, 
> &pcpu_slot[slot], list)
> + get_chunk_alloc_info(chunk, allocs + slot);
> +
> +                if (pcpu_reserved_chunk)
> +                                get_chunk_alloc_info(pcpu_reserved_chunk, 
> allocs + slot);
> +
> +                spin_unlock_irq(&pcpu_lock);
> +
> +                for (slot = 1; slot < pcpu_nr_slots + 1; slot++) {
> +                                for (i = 0; i < slot_users; i++) {
> +                                                sum = &allocs[i];
> +                                                cur = &allocs[i + slot * 
> slot_users];
> +
> +                                                if (sum->size == 
> cur->size &&
> +                                                    sum->caller == 
> cur->caller) {
> + sum->count += cur->count;
> +                                                                cur->size 
> = cur->count = 0;
> + cur->caller = NULL;
> +                                                }
> +                                }
> +                }
> +
> +                for (i = 0; i < max_users; i++)
> +                                if (allocs[i].size)
> +                                                seq_printf(m, 
> "%-10ld\t%-10d\t%-30pS\n", allocs[i].size,
> + allocs[i].count, allocs[i].caller);
> +
> +                vfree(allocs);
> +
> +                return 0;
> +}
> +DEFINE_SHOW_ATTRIBUTE(percpu_users);
> +
>  static int __init init_percpu_stats_debugfs(void)
>  {
>                  debugfs_create_file("percpu_stats", 0444, NULL, NULL,
>                                                  &percpu_stats_fops);
>  
> +                debugfs_create_file("percpu_users", 0444, NULL, NULL,
> +                                                &percpu_users_fops);
> +
>                  return 0;
>  }
>  
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 25104cd..549b29a 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1031,7 +1031,7 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, 
> int off)
>                  int bit_off, bits, end, oslot;
>  
>                  lockdep_assert_held(&pcpu_lock);
> -                pcpu_stats_area_dealloc(chunk);
> +                pcpu_stats_area_dealloc(chunk, off);
>  
>                  oslot = pcpu_chunk_slot(chunk);
>  
> @@ -1120,6 +1120,8 @@ static struct pcpu_chunk * __init 
> pcpu_alloc_first_chunk(unsigned long tmp_addr,
>          sizeof(chunk->bound_map[0]), 0);
>                  chunk->md_blocks = 
> memblock_alloc(pcpu_chunk_nr_blocks(chunk) *
>          sizeof(chunk->md_blocks[0]), 0);
> +                chunk->allocs = memblock_alloc(region_bits * 
> sizeof(*chunk->allocs), 0);
> +
>                  pcpu_init_md_blocks(chunk);
>  
>                  /* manage populated page bitmap */
> @@ -1190,6 +1192,15 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t 
> gfp)
>                  if (!chunk->md_blocks)
>                                  goto md_blocks_fail;
>  
> +#ifdef CONFIG_PERCPU_STATS
> +                chunk->allocs = pcpu_mem_zalloc(region_bits * 
> sizeof(chunk->allocs[0]),
> +  gfp);
> +                if (!chunk->allocs) {
> +                                pcpu_mem_free(chunk->md_blocks);
> +                                goto md_blocks_fail;
> +                }
> +#endif
> +
>                  pcpu_init_md_blocks(chunk);
>  
>                  /* init metadata */
> @@ -1212,6 +1223,9 @@ static void pcpu_free_chunk(struct pcpu_chunk 
> *chunk)
>  {
>                  if (!chunk)
>                                  return;
> +#ifdef CONFIG_PERCPU_STATS
> +                pcpu_mem_free(chunk->allocs);
> +#endif
>                  pcpu_mem_free(chunk->md_blocks);
>                  pcpu_mem_free(chunk->bound_map);
>                  pcpu_mem_free(chunk->alloc_map);
> @@ -1350,7 +1364,7 @@ static struct pcpu_chunk 
> *pcpu_chunk_addr_search(void *addr)
>   * Percpu pointer to the allocated area on success, NULL on failure.
>   */
>  static void __percpu *pcpu_alloc(size_t size, size_t align, bool 
> reserved,
> -                                                                 gfp_t 
> gfp)
> +                                                                 gfp_t 
> gfp, const void *caller)
>  {
>                  /* whitelisted flags that can be passed to the backing 
> allocators */
>                  gfp_t pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | 
> __GFP_NOWARN);
> @@ -1460,7 +1474,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
> align, bool reserved,
>                  goto restart;
>  
>  area_found:
> -                pcpu_stats_area_alloc(chunk, size);
> +                pcpu_stats_area_alloc(chunk, size, off, caller);
>                  spin_unlock_irqrestore(&pcpu_lock, flags);
>  
>                  /* populate if not all pages are already there */
> @@ -1543,7 +1557,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
> align, bool reserved,
>   */
>  void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp)
>  {
> -                return pcpu_alloc(size, align, false, gfp);
> +                return pcpu_alloc(size, align, false, gfp, 
> __builtin_return_address(0));
>  }
>  EXPORT_SYMBOL_GPL(__alloc_percpu_gfp);
>  
> @@ -1556,7 +1570,8 @@ EXPORT_SYMBOL_GPL(__alloc_percpu_gfp);
>   */
>  void __percpu *__alloc_percpu(size_t size, size_t align)
>  {
> -                return pcpu_alloc(size, align, false, GFP_KERNEL);
> +                return pcpu_alloc(size, align, false, GFP_KERNEL,
> + __builtin_return_address(0));
>  }
>  EXPORT_SYMBOL_GPL(__alloc_percpu);
>  
> @@ -1578,7 +1593,8 @@ EXPORT_SYMBOL_GPL(__alloc_percpu);
>   */
>  void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
>  {
> -                return pcpu_alloc(size, align, true, GFP_KERNEL);
> +                return pcpu_alloc(size, align, true, GFP_KERNEL,
> + __builtin_return_address(0));
>  }
>  
>  /**
> -- 
> 2.7.4
> 
> 
> -- 
> Sincerely yours,
> Mike.
> 
> 
> 
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number 
> 741598. 
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> 

-- 
Sincerely yours,
Mike.