On 7/29/19 5:21 PM, Rik van Riel wrote: > On Mon, 2019-07-29 at 17:07 -0400, Waiman Long wrote: >> It was found that a dying mm_struct where the owning task has exited >> can stay on as active_mm of kernel threads as long as no other user >> tasks run on those CPUs that use it as active_mm. This prolongs the >> life time of dying mm holding up some resources that cannot be freed >> on a mostly idle system. > On what kernels does this happen? > > Don't we explicitly flush all lazy TLB CPUs at exit > time, when we are about to free page tables? There are still a couple of calls that will be done until mm_count reaches 0: - mm_free_pgd(mm); - destroy_context(mm); - mmu_notifier_mm_destroy(mm); - check_mm(mm); - put_user_ns(mm->user_ns); These are not big items, but holding it off for a long time is still not a good thing. > Does this happen only on the CPU where the task in > question is exiting, or also on other CPUs? What I have found is that a long running process on a mostly idle system with many CPUs is likely to cycle through a lot of the CPUs during its lifetime and leave behind its mm in the active_mm of those CPUs. My 2-socket test system have 96 logical CPUs. After running the test program for a minute or so, it leaves behind its mm in about half of the CPUs with a mm_count of 45 after exit. So the dying mm will stay until all those 45 CPUs get new user tasks to run. > > If it is only on the CPU where the task is exiting, > would the TASK_DEAD handling in finish_task_switch() > be a better place to handle this? I need to switch the mm off the dying one. mm switching is only done in context_switch(). I don't think finish_task_switch() is the right place. -Longman