On Mon, Oct 08, 2018 at 01:45:23PM +0100, Daniel McGinnes wrote: > Hi Mike, > > sure, I'm running a test now with lower memory pressure as I'd like to > understand better how much memory pressure is needed to get the dying > cgroup reclamation to kick in. > > That should be finished tomorrow morning, when I plan to add in the 2 > patches you referenced, and can also add in your patch to add > percpu_users. I will also re-enable IPv6 at this point. > > I assume it makes sense to run all 3 of these together? Yep. Plus another one I've found today: https://marc.info/?l=linux-netdev&m=153900037804969 If you are planning to run with my percpu_users patch, please collect the /sys/kernel/debug/pecpu_users once in a while to see the dynamics. > thanks, > > Dan McGinnes > > IBM Cloud - Containers performance > > Int Tel: 247359 Ext Tel: 01962 817359 > > Notes: Daniel McGinnes/UK/IBM > Email: MCGINNES@xxxxxxxxxx > > IBM (UK) Ltd, Hursley Park,Winchester,Hampshire, SO21 2JN > > > > From: "Mike Rapoport" <rppt@xxxxxxxxxxxxxxxxxx> > To: Daniel McGinnes/UK/IBM@IBMGB > Cc: "Roman Gushchin" <guro@xxxxxx>, "cgroups@xxxxxxxxxxxxxxx" > <cgroups@xxxxxxxxxxxxxxx>, "Nathaniel Rockwell" <nrockwell@xxxxxxxxxx> > Date: 08/10/2018 13:09 > Subject: Re: PROBLEM: Memory leaking when running kubernetes > cronjobs > > > > Hi Dan, > > On Mon, Oct 08, 2018 at 11:35:12AM +0100, Daniel McGinnes wrote: > > Hi Mike, > > > > good finds! > > > > I let my run with IPv6 disabled run over the weekend and after ~4 days > > there has been no increase in Percpu memory and MemAvailable has > remained > > flat - so seems fairly conclusive that IPv6 was causing the remaining > > leak. > > > > Will be interesting to see if that fix you found resolves the issue when > > > IPv6 is re-enabled. > > It would be interesting to see what are the users of the percpu memory > with > the patch I've sent earlier. It outputs info similar to /proc/vmallocinfo > into /sys/kernel/debug/percpu_users. > > I didn't have a chance to verify that it's completely stable under stress, > though. > > > thanks, > > > > Dan McGinnes > > > > IBM Cloud - Containers performance > > > > Int Tel: 247359 Ext Tel: 01962 817359 > > > > Notes: Daniel McGinnes/UK/IBM > > Email: MCGINNES@xxxxxxxxxx > > > > IBM (UK) Ltd, Hursley Park,Winchester,Hampshire, SO21 2JN > > > > > > > > From: "Mike Rapoport" <rppt@xxxxxxxxxxxxxxxxxx> > > To: "Roman Gushchin" <guro@xxxxxx> > > Cc: Daniel McGinnes/UK/IBM@IBMGB, "cgroups@xxxxxxxxxxxxxxx" > > <cgroups@xxxxxxxxxxxxxxx>, "Nathaniel Rockwell" <nrockwell@xxxxxxxxxx> > > Date: 08/10/2018 08:05 > > Subject: Re: PROBLEM: Memory leaking when running kubernetes > > cronjobs > > > > > > > > On Sat, Oct 06, 2018 at 12:42:37AM +0000, Roman Gushchin wrote: > > > Hi Daniel! > > > > > > On Fri, Oct 05, 2018 at 10:16:25AM +0000, Daniel McGinnes wrote: > > > > Hi Roman, > > > > > > > > memory pressure was started after 1 hour (Ran stress --vm 16 > > --vm-bytes > > > > 1772864000 -t 300 for 5 minutes, then sleep for 5 mins in a > continuous > > > > > > loop). > > > > > > > > Machine has 16 cores & 32 GB RAM. > > > > > > > > I think the issue I still have is that even though the per-cpu is > able > > to > > > > be reused for other per-cpu allocations, my understanding is that it > > > will > > > > not be available for general use by applications - so if percpu > memory > > > > > > usage is growing continuously (which we still see happening pretty > > slowly > > > > - but over months it would be fairly significant) it means there > will > > be > > > > less memory available for applications to use. Please let me know if > > > I've > > > > mis-understood something here. > > > > > > Well, yeah, not looking good. > > > > > > > > > > > After seeing several stacks in IPv6 in the memory leak output I ran > a > > test > > > > with IPv6 disabled on the host. Interestingly after 24 hours the > > Percpu > > > > memory reported in meminfo seems to have flattened out, whereas with > > > IPv6 > > > > enabled it was still growing. MemAvailable is decreasing so slowly > > that I > > > > need to leave it longer to draw any conclusions from that. > > > > > > Looks like there is a independent per-cpu memory leak somewhere in the > > > ipv6 > > > stack. Not sure, of course, but if the number of dying cgroups is not > > growing... > > > > There is a leak in the percpu allocator itself, it never frees some of > its > > metadata. I've sent the fix yesterday [1], I believe it will be merged > in > > 4.19. > > > > Also, there was a recent fix for a leak ipv6 [2]. > > > > I'm now trying to see the dynamics of the percpu allocations, so I've > > added yet another debugfs interface for percpu (below) similar to > > /proc/vmallocinfo. I hope that by the end of the day I'll be able to see > > what is causing to increase in percpu memory. > > > > > Id try to check that my memleak.py is actually capturing all allocs, > > maybe > > > we're missing something... > > > > > > Thanks! > > > > > > > [1] https://lkml.org/lkml/2018/10/7/84 > > [2] > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce7ea4af0838ffd4667ecad4eb5eec7a25342f1e > > > > > > > > > From 3979853b25b3cd1e0ba0d5242c2145f63afd1ad1 Mon Sep 17 00:00:00 2001 > > From: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx> > > Date: Sun, 7 Oct 2018 19:41:17 +0300 > > Subject: [PATCH] percpu: stats: add info about memory users > > > > Print summary of allocation count, size and callers, similar to > > /proc/vmallocinfo. > > > > Signed-off-by: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx> > > --- > > mm/percpu-internal.h | 28 +++++++++++++++--- > > mm/percpu-stats.c | 82 > > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > mm/percpu.c | 28 ++++++++++++++---- > > 3 files changed, 128 insertions(+), 10 deletions(-) > > > > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h > > index b1739dc..b336ae4 100644 > > --- a/mm/percpu-internal.h > > +++ b/mm/percpu-internal.h > > @@ -21,10 +21,17 @@ struct pcpu_block_md { > > int first_free; /* block > position > > of first free */ > > }; > > > > +struct pcpu_allocs_info { > > + size_t size; > > + int count; > > + const void *caller; > > +}; > > + > > struct pcpu_chunk { > > #ifdef CONFIG_PERCPU_STATS > > int nr_alloc; > > /* # of allocations */ > > size_t max_alloc_size; /* largest allocation size */ > > + struct pcpu_allocs_info *allocs; > > #endif > > > > struct list_head list; /* linked to > > pcpu_slot lists */ > > @@ -136,8 +143,11 @@ static inline void pcpu_stats_save_ai(const struct > > pcpu_alloc_info *ai) > > * CONTEXT: > > * pcpu_lock. > > */ > > -static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, > size_t > > size) > > +static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, > size_t > > size, > > + int off, const void *caller) > > { > > + int idx = off / PCPU_MIN_ALLOC_SIZE; > > + > > lockdep_assert_held(&pcpu_lock); > > > > pcpu_stats.nr_alloc++; > > @@ -151,6 +161,10 @@ static inline void pcpu_stats_area_alloc(struct > > pcpu_chunk *chunk, size_t size) > > > > chunk->nr_alloc++; > > chunk->max_alloc_size = max(chunk->max_alloc_size, > size); > > + > > + chunk->allocs[idx].size = size; > > + chunk->allocs[idx].caller = caller; > > + chunk->allocs[idx].count = 1; > > } > > > > /* > > @@ -160,14 +174,19 @@ static inline void pcpu_stats_area_alloc(struct > > pcpu_chunk *chunk, size_t size) > > * CONTEXT: > > * pcpu_lock. > > */ > > -static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk) > > +static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk, > int > > off) > > { > > + int idx = off / PCPU_MIN_ALLOC_SIZE; > > + > > lockdep_assert_held(&pcpu_lock); > > > > pcpu_stats.nr_dealloc++; > > pcpu_stats.nr_cur_alloc--; > > > > chunk->nr_alloc--; > > + > > + chunk->allocs[idx].size = 0; > > + chunk->allocs[idx].caller = NULL; > > } > > > > /* > > @@ -204,11 +223,12 @@ static inline void pcpu_stats_save_ai(const struct > > > pcpu_alloc_info *ai) > > { > > } > > > > -static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, > size_t > > size) > > +static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, > size_t > > size, > > + int off, const void *caller) > > { > > } > > > > -static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk) > > +static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk, > int > > off) > > { > > } > > > > diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c > > index b5fdd43..c587393 100644 > > --- a/mm/percpu-stats.c > > +++ b/mm/percpu-stats.c > > @@ -44,6 +44,34 @@ static int find_max_nr_alloc(void) > > return max_nr_alloc; > > } > > > > +static void get_chunk_alloc_info(struct pcpu_chunk *chunk, > > + struct > > > pcpu_allocs_info *allocs) > > +{ > > + int region_bits = pcpu_chunk_map_bits(chunk); > > + int i, j; > > + > > + for (i = 0; i < region_bits; i++) { > > + struct pcpu_allocs_info *ai = > > &chunk->allocs[i]; > > + > > + if (!ai->size) > > + continue; > > + > > + for (j = 0; allocs[j].size; j++) { > > + if (allocs[j].size == > > ai->size && > > + allocs[j].caller == > > > ai->caller) { > > + allocs[j].count++; > > + break; > > + } > > + } > > + > > + if (!allocs[j].size) { > > + allocs[j].size = > > ai->size; > > + allocs[j].caller = > > ai->caller; > > + allocs[j].count = 1; > > + } > > + } > > +} > > + > > /* > > * Prints out chunk state. Fragmentation is considered between > > * the beginning of the chunk to the last allocation. > > @@ -225,11 +253,65 @@ static int percpu_stats_show(struct seq_file *m, > > void *v) > > } > > DEFINE_SHOW_ATTRIBUTE(percpu_stats); > > > > +static int percpu_users_show(struct seq_file *m, void *v) > > +{ > > + struct pcpu_allocs_info *allocs, *sum, *cur; > > + struct pcpu_chunk *chunk; > > + int unit_pages, slot_users, max_users; > > + int slot, i; > > + > > + unit_pages = pcpu_stats_ai.unit_size >> PAGE_SHIFT; > > + slot_users = pcpu_nr_pages_to_map_bits(unit_pages); > > + max_users = slot_users * (pcpu_nr_slots + 1); > > + > > + allocs = vzalloc(array_size(sizeof(*allocs), > max_users)); > > + if (!allocs) > > + return -ENOMEM; > > + > > + spin_lock_irq(&pcpu_lock); > > + > > + for (slot = 0; slot < pcpu_nr_slots; slot++) > > + list_for_each_entry(chunk, > > &pcpu_slot[slot], list) > > + get_chunk_alloc_info(chunk, allocs + slot); > > + > > + if (pcpu_reserved_chunk) > > + get_chunk_alloc_info(pcpu_reserved_chunk, > > allocs + slot); > > + > > + spin_unlock_irq(&pcpu_lock); > > + > > + for (slot = 1; slot < pcpu_nr_slots + 1; slot++) { > > + for (i = 0; i < slot_users; i++) { > > + sum = &allocs[i]; > > + cur = &allocs[i + slot > * > > slot_users]; > > + > > + if (sum->size == > > cur->size && > > + sum->caller == > > cur->caller) { > > + sum->count += cur->count; > > + cur->size > > = cur->count = 0; > > + cur->caller = NULL; > > + } > > + } > > + } > > + > > + for (i = 0; i < max_users; i++) > > + if (allocs[i].size) > > + seq_printf(m, > > "%-10ld\t%-10d\t%-30pS\n", allocs[i].size, > > + allocs[i].count, allocs[i].caller); > > + > > + vfree(allocs); > > + > > + return 0; > > +} > > +DEFINE_SHOW_ATTRIBUTE(percpu_users); > > + > > static int __init init_percpu_stats_debugfs(void) > > { > > debugfs_create_file("percpu_stats", 0444, NULL, NULL, > > &percpu_stats_fops); > > > > + debugfs_create_file("percpu_users", 0444, NULL, NULL, > > + &percpu_users_fops); > > + > > return 0; > > } > > > > diff --git a/mm/percpu.c b/mm/percpu.c > > index 25104cd..549b29a 100644 > > --- a/mm/percpu.c > > +++ b/mm/percpu.c > > @@ -1031,7 +1031,7 @@ static void pcpu_free_area(struct pcpu_chunk > *chunk, > > int off) > > int bit_off, bits, end, oslot; > > > > lockdep_assert_held(&pcpu_lock); > > - pcpu_stats_area_dealloc(chunk); > > + pcpu_stats_area_dealloc(chunk, off); > > > > oslot = pcpu_chunk_slot(chunk); > > > > @@ -1120,6 +1120,8 @@ static struct pcpu_chunk * __init > > pcpu_alloc_first_chunk(unsigned long tmp_addr, > > sizeof(chunk->bound_map[0]), 0); > > chunk->md_blocks = > > memblock_alloc(pcpu_chunk_nr_blocks(chunk) * > > sizeof(chunk->md_blocks[0]), 0); > > + chunk->allocs = memblock_alloc(region_bits * > > sizeof(*chunk->allocs), 0); > > + > > pcpu_init_md_blocks(chunk); > > > > /* manage populated page bitmap */ > > @@ -1190,6 +1192,15 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t > > gfp) > > if (!chunk->md_blocks) > > goto md_blocks_fail; > > > > +#ifdef CONFIG_PERCPU_STATS > > + chunk->allocs = pcpu_mem_zalloc(region_bits * > > sizeof(chunk->allocs[0]), > > + gfp); > > + if (!chunk->allocs) { > > + pcpu_mem_free(chunk->md_blocks); > > + goto md_blocks_fail; > > + } > > +#endif > > + > > pcpu_init_md_blocks(chunk); > > > > /* init metadata */ > > @@ -1212,6 +1223,9 @@ static void pcpu_free_chunk(struct pcpu_chunk > > *chunk) > > { > > if (!chunk) > > return; > > +#ifdef CONFIG_PERCPU_STATS > > + pcpu_mem_free(chunk->allocs); > > +#endif > > pcpu_mem_free(chunk->md_blocks); > > pcpu_mem_free(chunk->bound_map); > > pcpu_mem_free(chunk->alloc_map); > > @@ -1350,7 +1364,7 @@ static struct pcpu_chunk > > *pcpu_chunk_addr_search(void *addr) > > * Percpu pointer to the allocated area on success, NULL on failure. > > */ > > static void __percpu *pcpu_alloc(size_t size, size_t align, bool > > reserved, > > - gfp_t > > gfp) > > + gfp_t > > gfp, const void *caller) > > { > > /* whitelisted flags that can be passed to the backing > > allocators */ > > gfp_t pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | > > __GFP_NOWARN); > > @@ -1460,7 +1474,7 @@ static void __percpu *pcpu_alloc(size_t size, > size_t > > align, bool reserved, > > goto restart; > > > > area_found: > > - pcpu_stats_area_alloc(chunk, size); > > + pcpu_stats_area_alloc(chunk, size, off, caller); > > spin_unlock_irqrestore(&pcpu_lock, flags); > > > > /* populate if not all pages are already there */ > > @@ -1543,7 +1557,7 @@ static void __percpu *pcpu_alloc(size_t size, > size_t > > align, bool reserved, > > */ > > void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp) > > { > > - return pcpu_alloc(size, align, false, gfp); > > + return pcpu_alloc(size, align, false, gfp, > > __builtin_return_address(0)); > > } > > EXPORT_SYMBOL_GPL(__alloc_percpu_gfp); > > > > @@ -1556,7 +1570,8 @@ EXPORT_SYMBOL_GPL(__alloc_percpu_gfp); > > */ > > void __percpu *__alloc_percpu(size_t size, size_t align) > > { > > - return pcpu_alloc(size, align, false, GFP_KERNEL); > > + return pcpu_alloc(size, align, false, GFP_KERNEL, > > + __builtin_return_address(0)); > > } > > EXPORT_SYMBOL_GPL(__alloc_percpu); > > > > @@ -1578,7 +1593,8 @@ EXPORT_SYMBOL_GPL(__alloc_percpu); > > */ > > void __percpu *__alloc_reserved_percpu(size_t size, size_t align) > > { > > - return pcpu_alloc(size, align, true, GFP_KERNEL); > > + return pcpu_alloc(size, align, true, GFP_KERNEL, > > + __builtin_return_address(0)); > > } > > > > /** > > -- > > 2.7.4 > > > > > > -- > > Sincerely yours, > > Mike. > > > > > > > > Unless stated otherwise above: > > IBM United Kingdom Limited - Registered in England and Wales with number > > > 741598. > > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 > 3AU > > > > -- > Sincerely yours, > Mike. > > > > Unless stated otherwise above: > IBM United Kingdom Limited - Registered in England and Wales with number > 741598. > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU > -- Sincerely yours, Mike.