Re: PROBLEM: Memory leaking when running kubernetes cronjobs

Daniel McGinnes <MCGINNES@xxxxxxxxxx> · Mon, 8 Oct 2018 13:45:23 +0100

Hi Mike,

sure, I'm running a test now with lower memory pressure as I'd like to 
understand better how much memory pressure is needed to get the dying 
cgroup reclamation to kick in.

That should be finished tomorrow morning, when I plan to add in the 2 
patches you referenced, and can also add in your patch to add 
percpu_users. I will also re-enable IPv6 at this point.

I assume it makes sense to run all 3 of these together?

thanks,

Dan McGinnes

IBM Cloud - Containers performance

Int Tel: 247359        Ext Tel: 01962 817359

Notes: Daniel McGinnes/UK/IBM
Email: MCGINNES@xxxxxxxxxx

IBM (UK) Ltd, Hursley Park,Winchester,Hampshire, SO21 2JN

From:   "Mike Rapoport" <rppt@xxxxxxxxxxxxxxxxxx>
To:     Daniel McGinnes/UK/IBM@IBMGB
Cc:     "Roman Gushchin" <guro@xxxxxx>, "cgroups@xxxxxxxxxxxxxxx" 
<cgroups@xxxxxxxxxxxxxxx>, "Nathaniel Rockwell" <nrockwell@xxxxxxxxxx>
Date:   08/10/2018 13:09
Subject:        Re: PROBLEM: Memory leaking when running kubernetes 
cronjobs

Hi Dan,

On Mon, Oct 08, 2018 at 11:35:12AM +0100, Daniel McGinnes wrote:
> Hi Mike,
> 
> good finds!
> 
> I let my run with IPv6 disabled run over the weekend and after ~4 days 
> there has been no increase in Percpu memory and MemAvailable has 
remained 
> flat - so seems fairly conclusive that IPv6 was causing the remaining 
> leak.
> 
> Will be interesting to see if that fix you found resolves the issue when 

> IPv6 is re-enabled.

It would be interesting to see what are the users of the percpu memory 
with
the patch I've sent earlier. It outputs info similar to /proc/vmallocinfo
into /sys/kernel/debug/percpu_users. 

I didn't have a chance to verify that it's completely stable under stress,
though.

> thanks,
> 
> Dan McGinnes
> 
> IBM Cloud - Containers performance
> 
> Int Tel: 247359        Ext Tel: 01962 817359
> 
> Notes: Daniel McGinnes/UK/IBM
> Email: MCGINNES@xxxxxxxxxx
> 
> IBM (UK) Ltd, Hursley Park,Winchester,Hampshire, SO21 2JN
> 
> 
> 
> From:   "Mike Rapoport" <rppt@xxxxxxxxxxxxxxxxxx>
> To:     "Roman Gushchin" <guro@xxxxxx>
> Cc:     Daniel McGinnes/UK/IBM@IBMGB, "cgroups@xxxxxxxxxxxxxxx" 
> <cgroups@xxxxxxxxxxxxxxx>, "Nathaniel Rockwell" <nrockwell@xxxxxxxxxx>
> Date:   08/10/2018 08:05
> Subject:        Re: PROBLEM: Memory leaking when running kubernetes 
> cronjobs
> 
> 
> 
> On Sat, Oct 06, 2018 at 12:42:37AM +0000, Roman Gushchin wrote:
> > Hi Daniel!
> > 
> > On Fri, Oct 05, 2018 at 10:16:25AM +0000, Daniel McGinnes wrote:
> > > Hi Roman,
> > > 
> > > memory pressure was started after 1 hour (Ran stress --vm 16 
> --vm-bytes 
> > > 1772864000 -t 300 for 5 minutes, then sleep for 5 mins in a 
continuous 
> 
> > > loop).
> > > 
> > > Machine has 16 cores & 32 GB RAM.
> > > 
> > > I think the issue I still have is that even though the per-cpu is 
able 
> to 
> > > be reused for other per-cpu allocations, my understanding is that it 

> will 
> > > not be available for general use by applications - so if percpu 
memory 
> 
> > > usage is growing continuously (which we still see happening pretty 
> slowly 
> > > - but over months it would be fairly significant) it means there 
will 
> be 
> > > less memory available for applications to use. Please let me know if 

> I've 
> > > mis-understood something here.
> > 
> > Well, yeah, not looking good.
> > 
> > > 
> > > After seeing several stacks in IPv6 in the memory leak output I ran 
a 
> test 
> > > with IPv6 disabled on the host. Interestingly after 24 hours the 
> Percpu 
> > > memory reported in meminfo seems to have flattened out, whereas with 

> IPv6 
> > > enabled it was still growing. MemAvailable is decreasing so slowly 
> that I 
> > > need to leave it longer to draw any conclusions from that.
> > 
> > Looks like there is a independent per-cpu memory leak somewhere in the 

> ipv6
> > stack. Not sure, of course, but if the number of dying cgroups is not 
> growing...
> 
> There is a leak in the percpu allocator itself, it never frees some of 
its
> metadata. I've sent the fix yesterday [1], I believe it will be merged 
in
> 4.19.
> 
> Also, there was a recent fix for a leak ipv6 [2].
> 
> I'm now trying to see the dynamics of the  percpu allocations, so I've
> added yet another debugfs interface for percpu (below) similar to
> /proc/vmallocinfo. I hope that by the end of the day I'll be able to see
> what is causing to increase in percpu memory.
> 
> > Id try to check that my memleak.py is actually capturing all allocs, 
> maybe
> > we're missing something...
> > 
> > Thanks!
> > 
> 
> [1] https://lkml.org/lkml/2018/10/7/84
> [2] 
> 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce7ea4af0838ffd4667ecad4eb5eec7a25342f1e

> 
> 
> 
> From 3979853b25b3cd1e0ba0d5242c2145f63afd1ad1 Mon Sep 17 00:00:00 2001
> From: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx>
> Date: Sun, 7 Oct 2018 19:41:17 +0300
> Subject: [PATCH] percpu: stats: add info about memory users
> 
> Print summary of allocation count, size and callers, similar to
> /proc/vmallocinfo.
> 
> Signed-off-by: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx>
> ---
>  mm/percpu-internal.h | 28 +++++++++++++++---
>  mm/percpu-stats.c    | 82 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/percpu.c          | 28 ++++++++++++++----
>  3 files changed, 128 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index b1739dc..b336ae4 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -21,10 +21,17 @@ struct pcpu_block_md {
>                  int                     first_free;     /* block 
position 
> of first free */
>  };
> 
> +struct pcpu_allocs_info {
> +                size_t size;
> +                int count;
> +                const void *caller;
> +};
> +
>  struct pcpu_chunk {
>  #ifdef CONFIG_PERCPU_STATS
>                  int nr_alloc; 
>          /* # of allocations */
>                  size_t max_alloc_size; /* largest allocation size */
> +                struct pcpu_allocs_info *allocs;
>  #endif
> 
>                  struct list_head                list;  /* linked to 
> pcpu_slot lists */
> @@ -136,8 +143,11 @@ static inline void pcpu_stats_save_ai(const struct 
> pcpu_alloc_info *ai)
>   * CONTEXT:
>   * pcpu_lock.
>   */
> -static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, 
size_t 
> size)
> +static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, 
size_t 
> size,
> +   int off, const void *caller)
>  {
> +                int idx = off / PCPU_MIN_ALLOC_SIZE;
> +
>                  lockdep_assert_held(&pcpu_lock);
> 
>                  pcpu_stats.nr_alloc++;
> @@ -151,6 +161,10 @@ static inline void pcpu_stats_area_alloc(struct 
> pcpu_chunk *chunk, size_t size)
> 
>                  chunk->nr_alloc++;
>                  chunk->max_alloc_size = max(chunk->max_alloc_size, 
size);
> +
> +                chunk->allocs[idx].size = size;
> +                chunk->allocs[idx].caller = caller;
> +                chunk->allocs[idx].count = 1;
>  }
> 
>  /*
> @@ -160,14 +174,19 @@ static inline void pcpu_stats_area_alloc(struct 
> pcpu_chunk *chunk, size_t size)
>   * CONTEXT:
>   * pcpu_lock.
>   */
> -static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk)
> +static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk, 
int 
> off)
>  {
> +                int idx = off / PCPU_MIN_ALLOC_SIZE;
> +
>                  lockdep_assert_held(&pcpu_lock);
> 
>                  pcpu_stats.nr_dealloc++;
>                  pcpu_stats.nr_cur_alloc--;
> 
>                  chunk->nr_alloc--;
> +
> +                chunk->allocs[idx].size = 0;
> +                chunk->allocs[idx].caller = NULL;
>  }
> 
>  /*
> @@ -204,11 +223,12 @@ static inline void pcpu_stats_save_ai(const struct 

> pcpu_alloc_info *ai)
>  {
>  }
> 
> -static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, 
size_t 
> size)
> +static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, 
size_t 
> size,
> +   int off, const void *caller)
>  {
>  }
> 
> -static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk)
> +static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk, 
int 
> off)
>  {
>  }
> 
> diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
> index b5fdd43..c587393 100644
> --- a/mm/percpu-stats.c
> +++ b/mm/percpu-stats.c
> @@ -44,6 +44,34 @@ static int find_max_nr_alloc(void)
>                  return max_nr_alloc;
>  }
> 
> +static void get_chunk_alloc_info(struct pcpu_chunk *chunk,
> +                                                                 struct 

> pcpu_allocs_info *allocs)
> +{
> +                int region_bits = pcpu_chunk_map_bits(chunk);
> +                int i, j;
> +
> +                for (i = 0; i < region_bits; i++) {
> +                                struct pcpu_allocs_info *ai = 
> &chunk->allocs[i];
> +
> +                                if (!ai->size)
> +                                                continue;
> +
> +                                for (j = 0; allocs[j].size; j++) {
> +                                                if (allocs[j].size == 
> ai->size &&
> +                                                    allocs[j].caller == 

> ai->caller) {
> + allocs[j].count++;
> +                                                                break;
> +                                                }
> +                                }
> +
> +                                if (!allocs[j].size) {
> +                                                allocs[j].size = 
> ai->size;
> +                                                allocs[j].caller = 
> ai->caller;
> +                                                allocs[j].count = 1;
> +                                }
> +                }
> +}
> +
>  /*
>   * Prints out chunk state. Fragmentation is considered between
>   * the beginning of the chunk to the last allocation.
> @@ -225,11 +253,65 @@ static int percpu_stats_show(struct seq_file *m, 
> void *v)
>  }
>  DEFINE_SHOW_ATTRIBUTE(percpu_stats);
> 
> +static int percpu_users_show(struct seq_file *m, void *v)
> +{
> +                struct pcpu_allocs_info *allocs, *sum, *cur;
> +                struct pcpu_chunk *chunk;
> +                int unit_pages, slot_users, max_users;
> +                int slot, i;
> +
> +                unit_pages = pcpu_stats_ai.unit_size >> PAGE_SHIFT;
> +                slot_users = pcpu_nr_pages_to_map_bits(unit_pages);
> +                max_users = slot_users * (pcpu_nr_slots + 1);
> +
> +                allocs = vzalloc(array_size(sizeof(*allocs), 
max_users));
> +                if (!allocs)
> +                                return -ENOMEM;
> +
> +                spin_lock_irq(&pcpu_lock);
> +
> +                for (slot = 0; slot < pcpu_nr_slots; slot++)
> +                                list_for_each_entry(chunk, 
> &pcpu_slot[slot], list)
> + get_chunk_alloc_info(chunk, allocs + slot);
> +
> +                if (pcpu_reserved_chunk)
> + get_chunk_alloc_info(pcpu_reserved_chunk, 
> allocs + slot);
> +
> +                spin_unlock_irq(&pcpu_lock);
> +
> +                for (slot = 1; slot < pcpu_nr_slots + 1; slot++) {
> +                                for (i = 0; i < slot_users; i++) {
> +                                                sum = &allocs[i];
> +                                                cur = &allocs[i + slot 
* 
> slot_users];
> +
> +                                                if (sum->size == 
> cur->size &&
> +                                                    sum->caller == 
> cur->caller) {
> + sum->count += cur->count;
> + cur->size 
> = cur->count = 0;
> + cur->caller = NULL;
> +                                                }
> +                                }
> +                }
> +
> +                for (i = 0; i < max_users; i++)
> +                                if (allocs[i].size)
> +                                                seq_printf(m, 
> "%-10ld\t%-10d\t%-30pS\n", allocs[i].size,
> + allocs[i].count, allocs[i].caller);
> +
> +                vfree(allocs);
> +
> +                return 0;
> +}
> +DEFINE_SHOW_ATTRIBUTE(percpu_users);
> +
>  static int __init init_percpu_stats_debugfs(void)
>  {
>                  debugfs_create_file("percpu_stats", 0444, NULL, NULL,
>                                                  &percpu_stats_fops);
> 
> +                debugfs_create_file("percpu_users", 0444, NULL, NULL,
> +                                                &percpu_users_fops);
> +
>                  return 0;
>  }
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 25104cd..549b29a 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1031,7 +1031,7 @@ static void pcpu_free_area(struct pcpu_chunk 
*chunk, 
> int off)
>                  int bit_off, bits, end, oslot;
> 
>                  lockdep_assert_held(&pcpu_lock);
> -                pcpu_stats_area_dealloc(chunk);
> +                pcpu_stats_area_dealloc(chunk, off);
> 
>                  oslot = pcpu_chunk_slot(chunk);
> 
> @@ -1120,6 +1120,8 @@ static struct pcpu_chunk * __init 
> pcpu_alloc_first_chunk(unsigned long tmp_addr,
>          sizeof(chunk->bound_map[0]), 0);
>                  chunk->md_blocks = 
> memblock_alloc(pcpu_chunk_nr_blocks(chunk) *
>          sizeof(chunk->md_blocks[0]), 0);
> +                chunk->allocs = memblock_alloc(region_bits * 
> sizeof(*chunk->allocs), 0);
> +
>                  pcpu_init_md_blocks(chunk);
> 
>                  /* manage populated page bitmap */
> @@ -1190,6 +1192,15 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t 
> gfp)
>                  if (!chunk->md_blocks)
>                                  goto md_blocks_fail;
> 
> +#ifdef CONFIG_PERCPU_STATS
> +                chunk->allocs = pcpu_mem_zalloc(region_bits * 
> sizeof(chunk->allocs[0]),
> +  gfp);
> +                if (!chunk->allocs) {
> +                                pcpu_mem_free(chunk->md_blocks);
> +                                goto md_blocks_fail;
> +                }
> +#endif
> +
>                  pcpu_init_md_blocks(chunk);
> 
>                  /* init metadata */
> @@ -1212,6 +1223,9 @@ static void pcpu_free_chunk(struct pcpu_chunk 
> *chunk)
>  {
>                  if (!chunk)
>                                  return;
> +#ifdef CONFIG_PERCPU_STATS
> +                pcpu_mem_free(chunk->allocs);
> +#endif
>                  pcpu_mem_free(chunk->md_blocks);
>                  pcpu_mem_free(chunk->bound_map);
>                  pcpu_mem_free(chunk->alloc_map);
> @@ -1350,7 +1364,7 @@ static struct pcpu_chunk 
> *pcpu_chunk_addr_search(void *addr)
>   * Percpu pointer to the allocated area on success, NULL on failure.
>   */
>  static void __percpu *pcpu_alloc(size_t size, size_t align, bool 
> reserved,
> -                                                                 gfp_t 
> gfp)
> +                                                                 gfp_t 
> gfp, const void *caller)
>  {
>                  /* whitelisted flags that can be passed to the backing 
> allocators */
>                  gfp_t pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | 
> __GFP_NOWARN);
> @@ -1460,7 +1474,7 @@ static void __percpu *pcpu_alloc(size_t size, 
size_t 
> align, bool reserved,
>                  goto restart;
> 
>  area_found:
> -                pcpu_stats_area_alloc(chunk, size);
> +                pcpu_stats_area_alloc(chunk, size, off, caller);
>                  spin_unlock_irqrestore(&pcpu_lock, flags);
> 
>                  /* populate if not all pages are already there */
> @@ -1543,7 +1557,7 @@ static void __percpu *pcpu_alloc(size_t size, 
size_t 
> align, bool reserved,
>   */
>  void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp)
>  {
> -                return pcpu_alloc(size, align, false, gfp);
> +                return pcpu_alloc(size, align, false, gfp, 
> __builtin_return_address(0));
>  }
>  EXPORT_SYMBOL_GPL(__alloc_percpu_gfp);
> 
> @@ -1556,7 +1570,8 @@ EXPORT_SYMBOL_GPL(__alloc_percpu_gfp);
>   */
>  void __percpu *__alloc_percpu(size_t size, size_t align)
>  {
> -                return pcpu_alloc(size, align, false, GFP_KERNEL);
> +                return pcpu_alloc(size, align, false, GFP_KERNEL,
> + __builtin_return_address(0));
>  }
>  EXPORT_SYMBOL_GPL(__alloc_percpu);
> 
> @@ -1578,7 +1593,8 @@ EXPORT_SYMBOL_GPL(__alloc_percpu);
>   */
>  void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
>  {
> -                return pcpu_alloc(size, align, true, GFP_KERNEL);
> +                return pcpu_alloc(size, align, true, GFP_KERNEL,
> + __builtin_return_address(0));
>  }
> 
>  /**
> -- 
> 2.7.4
> 
> 
> -- 
> Sincerely yours,
> Mike.
> 
> 
> 
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number 

> 741598. 
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 
3AU
> 

-- 
Sincerely yours,
Mike.

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU