> On Jan 21, 2020, at 5:35 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Mon, Jan 20, 2020 at 03:35:09PM -0500, Qian Cai wrote: >>> On Jan 20, 2020, at 5:17 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: >>> >>> Bah.. that's horrible. Surely we can find a better place to do this in >>> the whole hotplug machinery. >>> >>> Perhaps you can have takedown_cpu() do the mmdrop()? >> >> The problem is that no all arch_cpu_idle_dead() will call >> idle_task_exit(). For example, alpha and parisc are not, so it needs > > How is that not broken? If the idle thread runs on an active_mm, we need > to drop that reference. This needs fixing regardless. It turns out alpha does not support cpu hotplug (no CONFIG_HOTPLUG_CPU), and parisc will use &init_mm for idle tasks of secondary CPUs as in, smp_callin() smp_cpu_init() /* Initialise the idle task for this CPU */ mmgrab(&init_mm); current->active_mm = &init_mm; BUG_ON(current->mm); enter_lazy_tlb(&init_mm, current); > >> to deal with some kind of ifdef dance in takedown_cpu() to >> conditionally call mmdrop() which sounds even more horrible? >> >> If you really prefer it anyway, maybe something like touching every arch’s __cpu_die() to also call mmdrop() depends on arches? >> >> BTW, how to obtain the other CPU’s current task mm? Is that cpu_rq(cpu)->curr->active_mm? > > Something like this; except you'll need to go audit archs to make sure > they all call idle_task_exit() and/or put in comments on why they don't > have to (perhaps their bringup switches them to &init_mm unconditionally > and the switch_mm() is not required). I have to modify your code slightly (below) to pass compilation, but then it will panic during cpu offline for some reasons, [ 258.972625][ T9006] Faulting instruction address: 0xc00000000010afc4 [ 258.972640][ T9006] Oops: Kernel access of bad area, sig: 11 [#1] [ 258.972651][ T9006] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 DEBUG_PAGEALLOC NUMA PowerNV [ 258.972675][ T9006] Modules linked in: kvm_hv kvm ip_tables x_tables xfs sd_mod bnx2x ahci mdio libahci tg3 libata libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod [ 258.972716][ T9006] CPU: 84 PID: 9006 Comm: cpuhotplug04.sh Not tainted 5.6.0-rc7-next-20200327+ #7 [ 258.972741][ T9006] NIP: c00000000010afc4 LR: c00000000010afa0 CTR: fffffffffffffffd [ 258.972775][ T9006] REGS: c000201a54def7f0 TRAP: 0300 Not tainted (5.6.0-rc7-next-20200327+) [ 258.972809][ T9006] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28224824 XER: 20040000 [ 258.972839][ T9006] CFAR: c00000000015951c DAR: 0000000000000050 DSISR: 40000000 IRQMASK: 0 [ 258.972839][ T9006] GPR00: c00000000010afa0 c000201a54defa80 c00000000165bd00 c000000001604180 [ 258.972839][ T9006] GPR04: c00000008160496f 0000000000000000 ffff0a01ffffff10 0000000000000020 [ 258.972839][ T9006] GPR08: 0000000000000050 0000000000000000 c00000000162c0e0 0000000000000002 [ 258.972839][ T9006] GPR12: 0000000088224428 c000201fff69a900 0000000040000000 000000011a029798 [ 258.972839][ T9006] GPR16: 000000011a029724 0000000119fc6968 0000000119f5f230 000000011a02d568 [ 258.972839][ T9006] GPR20: 00000001554c66f0 0000000000000000 c000001ffc957820 c00000000010af80 [ 258.972839][ T9006] GPR24: 0000001ffb8a0000 c0000000014f34a8 0000000000000056 c0000000010b7820 [ 258.972839][ T9006] GPR28: c00000000168c654 0000000000000000 c00000000168c3a8 0000000000000000 [ 258.973070][ T9006] NIP [c00000000010afc4] finish_cpu+0x44/0x90 atomic_dec_return_relaxed at arch/powerpc/include/asm/atomic.h:179 (inlined by) atomic_dec_return at include/linux/atomic-fallback.h:518 (inlined by) atomic_dec_and_test at include/linux/atomic-fallback.h:1035 (inlined by) mmdrop at include/linux/sched/mm.h:48 (inlined by) finish_cpu at kernel/cpu.c:579 [ 258.973092][ T9006] LR [c00000000010afa0] finish_cpu+0x20/0x90 [ 258.973113][ T9006] Call Trace: [ 258.973132][ T9006] [c000201a54defa80] [c00000000010afa0] finish_cpu+0x20/0x90 (unreliable) [ 258.973166][ T9006] [c000201a54defaa0] [c00000000010cd34] cpuhp_invoke_callback+0x194/0x15f0 cpuhp_invoke_callback at kernel/cpu.c:173 [ 258.973202][ T9006] [c000201a54defb50] [c000000000111acc] _cpu_down.constprop.15+0x17c/0x410 [ 258.973237][ T9006] [c000201a54defbc0] [c00000000010ff84] cpu_down+0x64/0xb0 [ 258.973263][ T9006] [c000201a54defc00] [c000000000754760] cpu_subsys_offline+0x20/0x40 [ 258.973299][ T9006] [c000201a54defc20] [c00000000074ab10] device_offline+0x100/0x140 [ 258.973344][ T9006] [c000201a54defc60] [c00000000074ace0] online_store+0xa0/0x110 [ 258.973378][ T9006] [c000201a54defcb0] [c000000000744508] dev_attr_store+0x38/0x60 [ 258.973414][ T9006] [c000201a54defcd0] [c0000000005df770] sysfs_kf_write+0x70/0xb0 [ 258.973438][ T9006] [c000201a54defd10] [c0000000005de92c] kernfs_fop_write+0x11c/0x270 [ 258.973463][ T9006] [c000201a54defd60] [c0000000004c09bc] __vfs_write+0x3c/0x70 [ 258.973487][ T9006] [c000201a54defd80] [c0000000004c3dbc] vfs_write+0xcc/0x200 [ 258.973522][ T9006] [c000201a54defdd0] [c0000000004c415c] ksys_write+0x7c/0x140 [ 258.973557][ T9006] [c000201a54defe20] [c00000000000b3f8] system_call+0x5c/0x68 [ 258.973579][ T9006] Instruction dump: [ 258.973598][ T9006] f8010010 f821ffe1 4804e4ed 60000000 3d42fffd 394a03e0 e9230560 7fa95000 [ 258.973636][ T9006] 419e0030 f9430560 39090050 7c0004ac <7d404028> 314affff 7d40412d 40c2fff4 [ 258.973675][ T9006] ---[ end trace 4f72c711066ac397 ]--- [ 259.076271][ T9006] [ 260.076362][ T9006] Kernel panic - not syncing: Fatal exception diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index c49257a3b510..bc782562ae93 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -49,6 +49,8 @@ static inline void mmdrop(struct mm_struct *mm) __mmdrop(mm); } +extern void mmdrop(struct mm_struct *mm); + /* * This has to be called after a get_task_mm()/mmget_not_zero() * followed by taking the mmap_sem for writing before modifying the diff --git a/kernel/cpu.c b/kernel/cpu.c index 0cca1a2b4930..302007a602a5 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -3,6 +3,7 @@ * * This code is licenced under the GPL. */ +#include <linux/sched/mm.h> #include <linux/proc_fs.h> #include <linux/smp.h> #include <linux/init.h> @@ -564,6 +565,22 @@ static int bringup_cpu(unsigned int cpu) return bringup_wait_for_ap(cpu); } +static int finish_cpu(unsigned int cpu) +{ + struct task_struct *idle = idle_thread_get(cpu); + struct mm_struct *mm = idle->active_mm; + + /* + * idle_task_exit() will have switched to &init_mm, now + * clean up any remaining active_mm state. + */ + if (mm != &init_mm) { + idle->active_mm = &init_mm; + mmdrop(mm); + } + return 0; +} + /* * Hotplug state machine related functions */ @@ -1549,7 +1566,7 @@ static struct cpuhp_step cpuhp_hp_states[] = { [CPUHP_BRINGUP_CPU] = { .name = "cpu:bringup", .startup.single = bringup_cpu, - .teardown.single = NULL, + .teardown.single = finish_cpu, .cant_stop = true, }, /* Final state before CPU kills itself */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ac8b65298774..16aad39216a2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6248,13 +6248,14 @@ void idle_task_exit(void) struct mm_struct *mm = current->active_mm; BUG_ON(cpu_online(smp_processor_id())); + BUG_ON(current != this_rq()->idle); if (mm != &init_mm) { switch_mm(mm, &init_mm, current); - current->active_mm = &init_mm; finish_arch_post_lock_switch(); } - mmdrop(mm); + + /* finish_cpu(), as ran on the BP, will clean up the active_mm state */ } /* > > --- > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 9c706af713fb..2b4d8a69e8d9 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -564,6 +564,23 @@ static int bringup_cpu(unsigned int cpu) > return bringup_wait_for_ap(cpu); > } > > +static int finish_cpu(unsigned int cpu) > +{ > + struct task_struct *idle = idle_thread_get(cpu); > + struct mm_struct *mm = idle->active_mm; > + > + /* > + * idle_task_exit() will have switched to &init_mm, now > + * clean up any remaining active_mm state. > + */ > + > + if (mm == &init_mm) > + return; > + > + idle->active_mm = &init_mm; > + mmdrop(mm); > +} > + > /* > * Hotplug state machine related functions > */ > @@ -1434,7 +1451,7 @@ static struct cpuhp_step cpuhp_hp_states[] = { > [CPUHP_BRINGUP_CPU] = { > .name = "cpu:bringup", > .startup.single = bringup_cpu, > - .teardown.single = NULL, > + .teardown.single = finish_cpu, > .cant_stop = true, > }, > /* Final state before CPU kills itself */ > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index fc1dfc007604..8f049fb77a3d 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6188,13 +6188,14 @@ void idle_task_exit(void) > struct mm_struct *mm = current->active_mm; > > BUG_ON(cpu_online(smp_processor_id())); > + BUG_ON(current != this_rq()->idle); > > if (mm != &init_mm) { > switch_mm(mm, &init_mm, current); > - current->active_mm = &init_mm; > finish_arch_post_lock_switch(); > } > - mmdrop(mm); > + > + /* finish_cpu(), as ran on the BP, will clean up the active_mm state */ > } > > /*