On Fri, Jan 29, 2016 at 06:19:31PM -0500, Johannes Weiner wrote: > Changing a page's memcg association complicates dealing with the page, > so we want to limit this as much as possible. Page migration e.g. does > not have to do that. Just like page cache replacement, it can forcibly > charge a replacement page, and then uncharge the old page when it gets > freed. Temporarily overcharging the cgroup by a single page is not an > issue in practice, and charging is so cheap nowadays that this is much > preferrable to the headache of messing with live pages. > > The only place that still changes the page->mem_cgroup binding of live > pages is when pages move along with a task to another cgroup. But that > path isolates the page from the LRU, takes the page lock, and the move > lock (lock_page_memcg()). That means page->mem_cgroup is always stable > in callers that have the page isolated from the LRU or locked. Lighter > unlocked paths, like writeback accounting, can use lock_page_memcg(). > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> [..] > @@ -372,12 +373,13 @@ int migrate_page_move_mapping(struct address_space *mapping, > * Now we know that no one else is looking at the page: > * no turning back from here. > */ > - set_page_memcg(newpage, page_memcg(page)); > newpage->index = page->index; > newpage->mapping = page->mapping; > if (PageSwapBacked(page)) > SetPageSwapBacked(newpage); > > + mem_cgroup_migrate(page, newpage); > + > get_page(newpage); /* add cache reference */ > if (PageSwapCache(page)) { > SetPageSwapCache(newpage); > @@ -457,9 +459,11 @@ int migrate_huge_page_move_mapping(struct address_space *mapping, > return -EAGAIN; > } > > - set_page_memcg(newpage, page_memcg(page)); > newpage->index = page->index; > newpage->mapping = page->mapping; > + > + mem_cgroup_migrate(page, newpage); > + > get_page(newpage); > > radix_tree_replace_slot(pslot, newpage); I ran trinity on recent linux-next and got the lockdep splat below and if I read it right, this is the culprit. In particular, mem_cgroup_migrate was put in an area covered by spin_lock_irq(&mapping->tree_lock), but stuff it calls enables and disables interrupts on its own. [ 105.084225] ================================= [ 105.096026] [ INFO: inconsistent lock state ] [ 105.107896] 4.5.0-rc2-next-20160203dupa+ #121 Not tainted [ 105.119904] --------------------------------- [ 105.131667] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ 105.144052] trinity-c3/1600 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 105.155944] (&(&mapping->tree_lock)->rlock){?.-.-.}, at: [<ffffffff8121d9b1>] migrate_page_move_mapping+0x71/0x4f0 [ 105.179744] {IN-HARDIRQ-W} state was registered at: [ 105.191556] [<ffffffff810d56ef>] __lock_acquire+0x79f/0xbf0 [ 105.203521] [<ffffffff810d632a>] lock_acquire+0xca/0x1c0 [ 105.215447] [<ffffffff819f72b5>] _raw_spin_lock_irqsave+0x55/0x90 [ 105.227537] [<ffffffff811b8307>] test_clear_page_writeback+0x67/0x2a0 [ 105.239612] [<ffffffff811a743f>] end_page_writeback+0x1f/0xa0 [ 105.251622] [<ffffffff81284816>] end_buffer_async_write+0xd6/0x1b0 [ 105.263649] [<ffffffff81283828>] end_bio_bh_io_sync+0x28/0x40 [ 105.276160] [<ffffffff81473900>] bio_endio+0x40/0x60 [ 105.288599] [<ffffffff817373ed>] dec_pending+0x15d/0x320 [ 105.301083] [<ffffffff81737e2b>] clone_endio+0x5b/0xe0 [ 105.313418] [<ffffffff81473900>] bio_endio+0x40/0x60 [ 105.325763] [<ffffffff8147bbf2>] blk_update_request+0xb2/0x3b0 [ 105.338236] [<ffffffff814867aa>] blk_mq_end_request+0x1a/0x70 [ 105.350626] [<ffffffff8164bcff>] virtblk_request_done+0x3f/0x70 [ 105.363303] [<ffffffff814852d3>] __blk_mq_complete_request_remote+0x13/0x20 [ 105.387289] [<ffffffff81116f1b>] flush_smp_call_function_queue+0x7b/0x150 [ 105.399903] [<ffffffff81117c23>] generic_smp_call_function_single_interrupt+0x13/0x60 [ 105.423568] [<ffffffff81041697>] smp_call_function_single_interrupt+0x27/0x40 [ 105.446971] [<ffffffff819f88e6>] call_function_single_interrupt+0x96/0xa0 [ 105.459109] [<ffffffff8121510e>] kmem_cache_alloc+0x27e/0x2e0 [ 105.471141] [<ffffffff811ab8dc>] mempool_alloc_slab+0x1c/0x20 [ 105.483200] [<ffffffff811abe29>] mempool_alloc+0x79/0x1b0 [ 105.495195] [<ffffffff814721b6>] bio_alloc_bioset+0x146/0x220 [ 105.507268] [<ffffffff81739103>] __split_and_process_bio+0x253/0x4f0 [ 105.519505] [<ffffffff8173966a>] dm_make_request+0x7a/0x110 [ 105.531563] [<ffffffff8147b3b6>] generic_make_request+0x166/0x2c0 [ 105.544062] [<ffffffff8147b587>] submit_bio+0x77/0x150 [ 105.556195] [<ffffffff81285dff>] submit_bh_wbc+0x12f/0x160 [ 105.568461] [<ffffffff81288228>] __block_write_full_page.constprop.41+0x138/0x3b0 [ 105.591728] [<ffffffff8128858c>] block_write_full_page+0xec/0x110 [ 105.603869] [<ffffffff81289088>] blkdev_writepage+0x18/0x20 [ 105.616272] [<ffffffff811b4796>] __writepage+0x16/0x50 [ 105.628206] [<ffffffff811b6e50>] write_cache_pages+0x2c0/0x620 [ 105.640298] [<ffffffff811b7204>] generic_writepages+0x54/0x80 [ 105.652244] [<ffffffff811b7f11>] do_writepages+0x21/0x40 [ 105.664204] [<ffffffff811a97f6>] __filemap_fdatawrite_range+0xc6/0x100 [ 105.676241] [<ffffffff811a988f>] filemap_write_and_wait+0x2f/0x60 [ 105.688339] [<ffffffff81289b8f>] __sync_blockdev+0x1f/0x40 [ 105.700350] [<ffffffff81289bc3>] sync_blockdev+0x13/0x20 [ 105.712352] [<ffffffff813430e9>] jbd2_journal_recover+0x119/0x130 [ 105.724532] [<ffffffff81348e90>] jbd2_journal_load+0xe0/0x390 [ 105.736533] [<ffffffff81406925>] ext4_load_journal+0x5ef/0x6b8 [ 105.748581] [<ffffffff813132d3>] ext4_fill_super+0x1ad3/0x2a10 [ 105.760597] [<ffffffff81249d9c>] mount_bdev+0x18c/0x1c0 [ 105.772574] [<ffffffff81302675>] ext4_mount+0x15/0x20 [ 105.784489] [<ffffffff8124a669>] mount_fs+0x39/0x170 [ 105.796406] [<ffffffff8126a59b>] vfs_kern_mount+0x6b/0x150 [ 105.808386] [<ffffffff8126d1fd>] do_mount+0x24d/0xed0 [ 105.820355] [<ffffffff8126e193>] SyS_mount+0x83/0xd0 [ 105.832317] [<ffffffff819f743c>] entry_SYSCALL_64_fastpath+0x1f/0xbd [ 105.844390] irq event stamp: 341784 [ 105.856114] hardirqs last enabled at (341783): [<ffffffff8104c6da>] flat_send_IPI_mask+0x8a/0xc0 [ 105.879644] hardirqs last disabled at (341784): [<ffffffff819f698f>] _raw_spin_lock_irq+0x1f/0x80 [ 105.903369] softirqs last enabled at (341744): [<ffffffff8107b933>] __do_softirq+0x343/0x480 [ 105.926856] softirqs last disabled at (341739): [<ffffffff8107bdc6>] irq_exit+0x106/0x120 [ 105.950499] [ 105.950499] other info that might help us debug this: [ 105.973803] Possible unsafe locking scenario: [ 105.973803] [ 105.997202] CPU0 [ 106.008754] ---- [ 106.020265] lock(&(&mapping->tree_lock)->rlock); [ 106.032196] <Interrupt> [ 106.043832] lock(&(&mapping->tree_lock)->rlock); [ 106.055891] [ 106.055891] *** DEADLOCK *** [ 106.055891] [ 106.090289] 2 locks held by trinity-c3/1600: [ 106.102049] #0: (&mm->mmap_sem){++++++}, at: [<ffffffff8105aeef>] __do_page_fault+0x15f/0x470 [ 106.125631] #1: (&(&mapping->tree_lock)->rlock){?.-.-.}, at: [<ffffffff8121d9b1>] migrate_page_move_mapping+0x71/0x4f0 [ 106.149777] [ 106.149777] stack backtrace: [ 106.172646] CPU: 3 PID: 1600 Comm: trinity-c3 Not tainted 4.5.0-rc2-next-20160203dupa+ #121 [ 106.196137] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 106.208200] 0000000000000086 00000000cf42937f ffff88004bdef630 ffffffff814acb6d [ 106.231859] ffff88004e0e0000 ffffffff82f17fb0 ffff88004bdef680 ffffffff811a47e9 [ 106.255250] 0000000000000000 ffff880000000001 ffff880000000001 0000000000000000 [ 106.278647] Call Trace: [ 106.290201] [<ffffffff814acb6d>] dump_stack+0x85/0xc8 [ 106.302013] [<ffffffff811a47e9>] print_usage_bug+0x1eb/0x1fc [ 106.314144] [<ffffffff810d2f70>] ? print_shortest_lock_dependencies+0x1d0/0x1d0 [ 106.337708] [<ffffffff810d490d>] mark_lock+0x20d/0x290 [ 106.349493] [<ffffffff810d4a01>] mark_held_locks+0x71/0x90 [ 106.361462] [<ffffffff819f6afc>] ? _raw_spin_unlock_irq+0x2c/0x40 [ 106.373469] [<ffffffff810d4ac9>] trace_hardirqs_on_caller+0xa9/0x1c0 [ 106.385466] [<ffffffff810d4bed>] trace_hardirqs_on+0xd/0x10 [ 106.397421] [<ffffffff819f6afc>] _raw_spin_unlock_irq+0x2c/0x40 [ 106.409417] [<ffffffff8123171e>] commit_charge+0xbe/0x390 [ 106.421310] [<ffffffff81233f25>] mem_cgroup_migrate+0x135/0x360 [ 106.433352] [<ffffffff8121da72>] migrate_page_move_mapping+0x132/0x4f0 [ 106.445369] [<ffffffff8121e68b>] migrate_page+0x2b/0x50 [ 106.457301] [<ffffffff8121ea2a>] buffer_migrate_page+0x10a/0x150 [ 106.469260] [<ffffffff8121e743>] move_to_new_page+0x93/0x270 [ 106.481209] [<ffffffff811f0a07>] ? try_to_unmap+0xa7/0x170 [ 106.493094] [<ffffffff811ef170>] ? page_remove_rmap+0x2a0/0x2a0 [ 106.505052] [<ffffffff811edb40>] ? __hugepage_set_anon_rmap+0x80/0x80 [ 106.517130] [<ffffffff8121f2b6>] migrate_pages+0x846/0xac0 [ 106.528993] [<ffffffff811d6570>] ? __reset_isolation_suitable+0x120/0x120 [ 106.541061] [<ffffffff811d7c90>] ? isolate_freepages_block+0x4e0/0x4e0 [ 106.553068] [<ffffffff811d8f1d>] compact_zone+0x33d/0xa80 [ 106.565026] [<ffffffff811d96db>] compact_zone_order+0x7b/0xc0 [ 106.576944] [<ffffffff811d9a0a>] try_to_compact_pages+0x13a/0x2e0 [ 106.588948] [<ffffffff8123f829>] __alloc_pages_direct_compact+0x3b/0xf9 [ 106.600995] [<ffffffff8123fbcc>] __alloc_pages_slowpath.constprop.87+0x2e5/0x886 [ 106.624383] [<ffffffff811b3a66>] __alloc_pages_nodemask+0x456/0x460 [ 106.636380] [<ffffffff8120aa0b>] alloc_pages_vma+0x28b/0x2d0 [ 106.648440] [<ffffffff81226d9e>] do_huge_pmd_anonymous_page+0x13e/0x540 [ 106.660497] [<ffffffff811e46e4>] handle_mm_fault+0x7e4/0x980 [ 106.672585] [<ffffffff811e3f59>] ? handle_mm_fault+0x59/0x980 [ 106.684595] [<ffffffff8105af5d>] __do_page_fault+0x1cd/0x470 [ 106.696524] [<ffffffff8105b2ee>] trace_do_page_fault+0x6e/0x250 [ 106.708477] [<ffffffff81054c3a>] do_async_page_fault+0x1a/0xb0 [ 106.720407] [<ffffffff819f9488>] async_page_fault+0x28/0x30 -- Mateusz Guzik -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html