On Tue, Mar 10, 2020 at 10:48:36AM +0100, Michal Hocko wrote: > [Cc Kirill, I didn't realize he has implemented this code] My first non-trivial mm contribution :P > On Fri 06-03-20 09:02:02, brookxu wrote: > > From: Chunguang Xu <brookxu@xxxxxxxxxxx> > > > > An eventfd monitors multiple memory thresholds of the cgroup, closes them, > > the kernel deletes all events related to this eventfd. Before all events > > are deleted, another eventfd monitors the memory threshold of this cgroup, > > leading to a crash: > > > > [ 135.675108] BUG: kernel NULL pointer dereference, address: 0000000000000004 > > [ 135.675350] #PF: supervisor write access in kernel mode > > [ 135.675579] #PF: error_code(0x0002) - not-present page > > [ 135.675816] PGD 800000033058e067 P4D 800000033058e067 PUD 3355ce067 PMD 0 > > [ 135.676080] Oops: 0002 [#1] SMP PTI > > [ 135.676332] CPU: 2 PID: 14012 Comm: kworker/2:6 Kdump: loaded Not tainted 5.6.0-rc4 #3 > > [ 135.676610] Hardware name: LENOVO 20AWS01K00/20AWS01K00, BIOS GLET70WW (2.24 ) 05/21/2014 > > [ 135.676909] Workqueue: events memcg_event_remove > > [ 135.677192] RIP: 0010:__mem_cgroup_usage_unregister_event+0xb3/0x190 > > [ 135.677825] RSP: 0018:ffffb47e01c4fe18 EFLAGS: 00010202 > > [ 135.678186] RAX: 0000000000000001 RBX: ffff8bb223a8a000 RCX: 0000000000000001 > > [ 135.678548] RDX: 0000000000000001 RSI: ffff8bb22fb83540 RDI: 0000000000000001 > > [ 135.678912] RBP: ffffb47e01c4fe48 R08: 0000000000000000 R09: 0000000000000010 > > [ 135.679287] R10: 000000000000000c R11: 071c71c71c71c71c R12: ffff8bb226aba880 > > [ 135.679670] R13: ffff8bb223a8a480 R14: 0000000000000000 R15: 0000000000000000 > > [ 135.680066] FS: 0000000000000000(0000) GS:ffff8bb242680000(0000) knlGS:0000000000000000 > > [ 135.680475] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 135.680894] CR2: 0000000000000004 CR3: 000000032c29c003 CR4: 00000000001606e0 > > [ 135.681325] Call Trace: > > [ 135.681763] memcg_event_remove+0x32/0x90 > > [ 135.682209] process_one_work+0x172/0x380 > > [ 135.682657] worker_thread+0x49/0x3f0 > > [ 135.683111] kthread+0xf8/0x130 > > [ 135.683570] ? max_active_store+0x80/0x80 > > [ 135.684034] ? kthread_bind+0x10/0x10 > > [ 135.684506] ret_from_fork+0x35/0x40 > > [ 135.689733] CR2: 0000000000000004 > > > > We can reproduce this problem in the following ways: > > > > 1. We create a new cgroup subdirectory and a new eventfd, and then we > > monitor multiple memory thresholds of the cgroup through this eventfd. > > 2. closing this eventfd, and __mem_cgroup_usage_unregister_event () will be > > called multiple times to delete all events related to this eventfd. > > > > The first time __mem_cgroup_usage_unregister_event() is called, the kernel > > will clear all items related to this eventfd in thresholds-> primary.Since > > there is currently only one eventfd, thresholds-> primary becomes empty, > > so the kernel will set thresholds-> primary and hresholds-> spare to NULL. ^ typo > > If at this time, the user creates a new eventfd and monitor the memory > > threshold of this cgroup, kernel will re-initialize thresholds-> primary. > > Then when __mem_cgroup_usage_unregister_event () is called for the second > > time, because thresholds-> primary is not empty, the system will access > > thresholds-> spare, but thresholds-> spare is NULL, which will trigger a > > crash. > > > > In general, the longer it takes to delete all events related to this > > eventfd, the easier it is to trigger this problem. > > > > The solution is to check whether the thresholds associated with the eventfd > > has been cleared when deleting the event. If so, we do nothing. > > > > Signed-off-by: Chunguang Xu <brookxu@xxxxxxxxxxx> > > The fix looks reasonable to me > Acked-by: Michal Hocko <mhocko@xxxxxxxx> Agreed. Two typos have to be addressed. Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> > It seems that the code has been broken since 2c488db27b61 ("memcg: clean > up memory thresholds"). We've had 371528caec55 ("mm: memcg: Correct > unregistring of events attached to the same eventfd") but it didn't > catch this case for some reason. Unless I am missing something the code > was broken back then already. Kirill please double check after me. I think the issue exitsted before 2c488db27b61. The fields had different names back then. The logic to make unregister never-fail is added in 907860ed381a ("cgroups: make cftype.unregister_event() void-returning"). I believe the Fixes should point there. > > So if I am not wrong then we want > Fixes: 2c488db27b61 ("memcg: clean up memory thresholds") > Cc: stable > > sounds appropriate because this seems to be user trigerable. > > Thanks for preparing the patch! > > Btw. you should double check your email sender because it seemed to > whitespace damaged the patch (\t -> spaces). Please use git send-email > instead. > > > --- > > mm/memcontrol.c | 10 ++++++++-- > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index d09776c..4575a58 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -4027,7 +4027,7 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg, > > struct mem_cgroup_thresholds *thresholds; > > struct mem_cgroup_threshold_ary *new; > > unsigned long usage; > > - int i, j, size; > > + int i, j, size, entries; > > > > mutex_lock(&memcg->thresholds_lock); > > > > @@ -4047,12 +4047,18 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg, > > __mem_cgroup_threshold(memcg, type == _MEMSWAP); > > > > /* Calculate new number of threshold */ > > - size = 0; > > + size = entries = 0; > > for (i = 0; i < thresholds->primary->size; i++) { > > if (thresholds->primary->entries[i].eventfd != eventfd) > > size++; > > + else > > + entries++; > > } > > > > + /* If items related to eventfd have been cleared, nothing to do */ ^ "no items" ? > > + if (!entries) > > + goto unlock; > > + > > new = thresholds->spare; > > > > /* Set thresholds array to NULL if we don't have thresholds */ > > -- > > 1.8.3.1 > > -- > Michal Hocko > SUSE Labs -- Kirill A. Shutemov