Re: [patch 13/13] mm: memcontrol: rewrite uncharge API

Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> · Tue, 15 Jul 2014 16:49:53 -0400

On Tue, Jul 15, 2014 at 03:04:54PM -0400, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 02:43:58PM -0400, Naoya Horiguchi wrote:
> > On Tue, Jul 15, 2014 at 01:34:39PM -0400, Johannes Weiner wrote:
> > > On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> > > > On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > > > > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > > > > ...
> > > > > > diff --git a/mm/swap.c b/mm/swap.c
> > > > > > index a98f48626359..3074210f245d 100644
> > > > > > --- a/mm/swap.c
> > > > > > +++ b/mm/swap.c
> > > > > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > > > > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > > > > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > > > >  	}
> > > > > > +	mem_cgroup_uncharge(page);
> > > > > >  }
> > > > > >  
> > > > > >  static void __put_single_page(struct page *page)
> > > > > 
> > > > > This seems to cause a list breakage in hstate->hugepage_activelist
> > > > > when freeing a hugetlbfs page.
> > > > 
> > > > This looks like a fall out from
> > > > http://marc.info/?l=linux-mm&m=140475936311294&w=2
> > > > 
> > > > I didn't get to review this one but the easiest fix seems to be check
> > > > HugePage and do not call uncharge.
> > > 
> > > Yes, that makes sense.  I'm also moving the uncharge call into
> > > __put_single_page() and __put_compound_page() so that PageHuge(), a
> > > function call, only needs to be checked for compound pages.
> > > 
> > > > > For hugetlbfs, we uncharge in free_huge_page() which is called after
> > > > > __page_cache_release(), so I think that we don't have to uncharge here.
> > > > > 
> > > > > In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> > > > > fixed the problem, so if that works for you, could you fold the change
> > > > > into your patch?
> > > 
> > > Memcg pages that *do* need uncharging might not necessarily be on the
> > > LRU list.
> > 
> > OK.
> > 
> > > Does the following work for you?
> > 
> > Unfortunately, with this change I saw the following bug message when
> > stressing with hugepage migration.
> > move_to_new_page() is called by unmap_and_move_huge_page() too, so
> > we need some hugetlb related code around mem_cgroup_migrate().
> 
> Can we just move hugetlb_cgroup_migrate() into move_to_new_page()?  It
> doesn't seem to be dependent of any page-specific state.
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7f5a42403fae..219da52d2f43 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -781,7 +781,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  		if (!PageAnon(newpage))
>  			newpage->mapping = NULL;
>  	} else {
> -		mem_cgroup_migrate(page, newpage, false);
> +		if (PageHuge(page))
> +			hugetlb_cgroup_migrate(hpage, new_hpage);

			hugetlb_cgroup_migrate(page, newpage);

to build successfully.

And yes, with this chanage the bug in move_to_new_page() is gone,
so we stepped one step further.

But I faced another bugs like below.

[   56.692744] BUG: Bad page state in process sysctl  pfn:71c00
[   56.693722] page:ffffea0001c70000 count:0 mapcount:0 mapping:          (null) index:0x8
[   56.695121] page flags: 0x5fffff80004008(uptodate|head)
[   56.695990] page dumped because: cgroup check failed
[   56.696816] pc:ffff88007eb9c000 pc->flags:7 pc->mem_cgroup:ffff8800be59a800
[   56.698059] Modules linked in: stap_6484a34ef9f0ebb4400874c66d0905ac__1496(O) bnep bluetooth ip6t_rpfilter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 cfg80211 xt_conntrack rfk
ill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_def
rag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev microcode parport_pc serio_raw parport virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss o
id_registry nfs_acl lockd sunrpc virtio_blk virtio_net ata_generic pata_acpi floppy
[   56.707416] CPU: 2 PID: 1872 Comm: sysctl Tainted: G    B      O  3.15.0-140715-1512-00017-gf1ab1502aa49 #264
[   56.709024] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   56.709810]  ffffffff81a8e0d5 ffff88003d787cb0 ffffffff8172d057 ffff88003d787cc8
[   56.711158]  ffffffff8172d08e ffffea0001c70000 ffff88003d787cf0 ffffffff8119e7a5
[   56.712344]  0000000000000000 000fffff80000000 ffffffff81a8e0d5 ffff88003d787d28
[   56.713551] Call Trace:
[   56.714088]  [<ffffffff8172d057>] __dump_stack+0x19/0x1b
[   56.714793]  [<ffffffff8172d08e>] dump_stack+0x35/0x46
[   56.715546]  [<ffffffff8119e7a5>] bad_page+0xd5/0x130
[   56.716369]  [<ffffffff8119e958>] free_pages_prepare+0x158/0x190
[   56.717222]  [<ffffffff8119edab>] __free_pages_ok+0x1b/0xb0
[   56.717960]  [<ffffffff8119f859>] __free_pages+0x29/0x50
[   56.718710]  [<ffffffff811dbce0>] update_and_free_page+0xd0/0x110
[   56.719575]  [<ffffffff811dd663>] free_pool_huge_page+0xd3/0xf0
[   56.720407]  [<ffffffff811dd7ec>] set_max_huge_pages+0x16c/0x1c0
[   56.721255]  [<ffffffff811dd968>] __nr_hugepages_store_common+0x128/0x1a0
[   56.722203]  [<ffffffff811ddb28>] hugetlb_sysctl_handler_common+0x98/0xb0
[   56.723147]  [<ffffffff811de56e>] hugetlb_sysctl_handler+0x1e/0x20
[   56.723962]  [<ffffffff8127a103>] proc_sys_call_handler+0xa3/0xb0
[   56.724805]  [<ffffffff8127a124>] proc_sys_write+0x14/0x20
[   56.725844]  [<ffffffff8120921a>] vfs_write+0xba/0x1e0
[   56.726792]  [<ffffffff81209d8d>] SyS_write+0x4d/0xc0
[   56.727596]  [<ffffffff81742a12>] system_call_fastpath+0x16/0x1b
[   58.894865] page:ffffea0001cf8000 count:2 mapcount:0 mapping:ffff88003d481278 index:0x1
[   58.896112] page flags: 0x5fffff80004809(locked|uptodate|private|head)
[   58.897148] page dumped because: VM_BUG_ON_PAGE(PageCgroupUsed(pc))
[   58.899325] pc:ffff88007ebbe000 pc->flags:7 pc->mem_cgroup:ffff8800be59a800
[   58.900359] ------------[ cut here ]------------
[   58.901016] kernel BUG at /src/linux-dev/mm/memcontrol.c:2707!
[   58.901331] invalid opcode: 0000 [#1] SMP
[   58.901331] Modules linked in: stap_6484a34ef9f0ebb4400874c66d0905ac__1496(O) bnep bluetooth ip6t_rpfilter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 cfg80211 xt_conntrack rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev microcode parport_pc serio_raw parport virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc virtio_blk virtio_net ata_generic pata_acpi floppy
[   58.901331] CPU: 1 PID: 1918 Comm: mbind_fuzz Tainted: G    B      O  3.15.0-140715-1512-00017-gf1ab1502aa49 #264
[   58.901331] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   58.901331] task: ffff8800bd763b20 ti: ffff8800bd750000 task.ti: ffff8800bd750000
[   58.901331] RIP: 0010:[<ffffffff811fee3b>]  [<ffffffff811fee3b>] commit_charge+0x28b/0x2b0
[   58.901331] RSP: 0000:ffff8800bd753c38  EFLAGS: 00010296
[   58.901331] RAX: 000000000000003f RBX: ffffea0001cf8000 RCX: 0000000000000000
[   58.901331] RDX: 0000000000000001 RSI: ffff88007ec0d318 RDI: ffff88007ec0d318
[   58.901331] RBP: ffff8800bd753c78 R08: 000000000000000a R09: 0000000000000000
[   58.901331] R10: 0000000000000000 R11: ffff8800bd75390e R12: ffff8800be59a800
[   58.901331] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88007ebbe000
[   58.901331] FS:  00007f9ce6fa0740(0000) GS:ffff88007ec00000(0000) knlGS:0000000000000000
[   58.901331] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.901331] CR2: 0000700004600000 CR3: 000000007c194000 CR4: 00000000000006e0
[   58.901331] Stack:
[   58.901331]  ffff8800be59a800 ffffea0001cf8000 000002003d481290 ffffea0001cf8000
[   58.901331]  ffff88003d481278 0000000000000000 ffff88003d481290 00000000000000d0
[   58.901331]  ffff8800bd753c90 ffffffff812020fc ffffea0001cf8000 ffff8800bd753cd8
[   58.901331] Call Trace:
[   58.901331]  [<ffffffff812020fc>] mem_cgroup_commit_charge+0x6c/0xf0
[   58.901331]  [<ffffffff81196c8c>] __add_to_page_cache_locked+0xec/0x1e0
[   58.901331]  [<ffffffff81196d91>] add_to_page_cache_locked+0x11/0x20
[   58.901331]  [<ffffffff811df425>] hugetlb_no_page+0x105/0x3b0
[   58.901331]  [<ffffffff8138f799>] ? __rb_insert_augmented+0xf9/0x1e0
[   58.901331]  [<ffffffff811e02f4>] hugetlb_fault+0x2c4/0x3c0
[   58.901331]  [<ffffffff811bd184>] ? vma_interval_tree_insert+0x84/0x90
[   58.901331]  [<ffffffff811c5d93>] __handle_mm_fault+0x303/0x340
[   58.901331]  [<ffffffff811c5e5f>] handle_mm_fault+0x8f/0x130
[   58.901331]  [<ffffffff8173d3f6>] __do_page_fault+0x176/0x520
[   58.901331]  [<ffffffff8132d993>] ? file_map_prot_check+0x63/0xd0
[   58.901331]  [<ffffffff811b46a9>] ? vm_mmap_pgoff+0x99/0xc0
[   58.901331]  [<ffffffff8173d7ac>] do_page_fault+0xc/0x10
[   58.901331]  [<ffffffff8173a122>] page_fault+0x22/0x30
[   58.901331] Code: 13 45 19 c0 41 83 e0 02 48 c1 ea 06 83 e2 01 48 83 fa 01 41 83 d8 ff e9 30 ff ff ff 48 c7 c6 20 d0 a8 81 48 89 df e8 55 fb f9 ff <0f> 0b 48 c7 c6 f3 e2 a8 81 48 89 df e8 44 fb f9 ff 0f 0b 48 c7
[   58.901331] RIP  [<ffffffff811fee3b>] commit_charge+0x28b/0x2b0
[   58.901331]  RSP <ffff8800bd753c38>
[   58.944251] ---[ end trace 2f1aecd49dae161f ]---

I feel that these 2 messages have the same cause (just appear differently).
__add_to_page_cache_locked() (and mem_cgroup_try_charge()) can be called
for hugetlb, while we avoid calling mem_cgroup_migrate()/mem_cgroup_uncharge()
for hugetlb. This seems to make page_cgroup of the hugepage inconsistent,
and results in the bad page bug ("page dumped because: cgroup check failed").
So maybe some more PageHuge check is necessary around the charging code.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>