On 7/30/19 4:43 AM, Peter Zijlstra wrote: > On Mon, Jul 29, 2019 at 05:07:28PM -0400, Waiman Long wrote: >> It was found that a dying mm_struct where the owning task has exited >> can stay on as active_mm of kernel threads as long as no other user >> tasks run on those CPUs that use it as active_mm. This prolongs the >> life time of dying mm holding up some resources that cannot be freed >> on a mostly idle system. >> >> Fix that by forcing the kernel threads to use init_mm as the active_mm >> during a kernel thread to kernel thread transition if the previous >> active_mm is dying (!mm_users). This will allows the freeing of resources >> associated with the dying mm ASAP. >> >> The presence of a kernel-to-kernel thread transition indicates that >> the cpu is probably idling with no higher priority user task to run. >> So the overhead of loading the mm_users cacheline should not really >> matter in this case. >> >> My testing on an x86 system showed that the mm_struct was freed within >> seconds after the task exited instead of staying alive for minutes or >> even longer on a mostly idle system before this patch. >> >> Signed-off-by: Waiman Long <longman@xxxxxxxxxx> >> --- >> kernel/sched/core.c | 21 +++++++++++++++++++-- >> 1 file changed, 19 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 795077af4f1a..41997e676251 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -3214,6 +3214,8 @@ static __always_inline struct rq * >> context_switch(struct rq *rq, struct task_struct *prev, >> struct task_struct *next, struct rq_flags *rf) >> { >> + struct mm_struct *next_mm = next->mm; >> + >> prepare_task_switch(rq, prev, next); >> >> /* >> @@ -3229,8 +3231,22 @@ context_switch(struct rq *rq, struct task_struct *prev, >> * >> * kernel -> user switch + mmdrop() active >> * user -> user switch >> + * >> + * kernel -> kernel and !prev->active_mm->mm_users: >> + * switch to init_mm + mmgrab() + mmdrop() >> */ >> - if (!next->mm) { // to kernel >> + if (!next_mm) { // to kernel >> + /* >> + * Checking is only done on kernel -> kernel transition >> + * to avoid any performance overhead while user tasks >> + * are running. >> + */ >> + if (unlikely(!prev->mm && >> + !atomic_read(&prev->active_mm->mm_users))) { >> + next_mm = next->active_mm = &init_mm; >> + mmgrab(next_mm); >> + goto mm_switch; >> + } >> enter_lazy_tlb(prev->active_mm, next); >> >> next->active_mm = prev->active_mm; > So I _really_ hate this complication. I'm thinking if you really care > about this the time is much better spend getting rid of the active_mm > tracking for x86 entirely. > That is fine. I won't pursue further. I will take a look at your suggestion when I have time, but it will probably be a while :-) Cheers, Longman