On 7/29/19 4:52 AM, Peter Zijlstra wrote: > On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote: >> It was found that a dying mm_struct where the owning task has exited >> can stay on as active_mm of kernel threads as long as no other user >> tasks run on those CPUs that use it as active_mm. This prolongs the >> life time of dying mm holding up memory and other resources like swap >> space that cannot be freed. > Sure, but this has been so 'forever', why is it a problem now? I ran into this probem when running a test program that keeps on allocating and touch memory and it eventually fails as the swap space is full. After the failure, I could not rerun the test program again because the swap space remained full. I finally track it down to the fact that the mm stayed on as active_mm of kernel threads. I have to make sure that all the idle cpus get a user task to run to bump the dying mm off the active_mm of those cpus, but this is just a workaround, not a solution to this problem. > >> Fix that by forcing the kernel threads to use init_mm as the active_mm >> if the previous active_mm is dying. >> >> The determination of a dying mm is based on the absence of an owning >> task. The selection of the owning task only happens with the CONFIG_MEMCG >> option. Without that, there is no simple way to determine the life span >> of a given mm. So it falls back to the old behavior. >> >> Signed-off-by: Waiman Long <longman@xxxxxxxxxx> >> --- >> include/linux/mm_types.h | 15 +++++++++++++++ >> kernel/sched/core.c | 13 +++++++++++-- >> mm/init-mm.c | 4 ++++ >> 3 files changed, 30 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index 3a37a89eb7a7..32712e78763c 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm) >> return atomic_read(&mm->tlb_flush_pending) > 1; >> } >> >> +#ifdef CONFIG_MEMCG >> +/* >> + * A mm is considered dying if there is no owning task. >> + */ >> +static inline bool mm_dying(struct mm_struct *mm) >> +{ >> + return !mm->owner; >> +} >> +#else >> +static inline bool mm_dying(struct mm_struct *mm) >> +{ >> + return false; >> +} >> +#endif >> + >> struct vm_fault; > Yuck. So people without memcg will still suffer the terrible 'whatever > it is this patch fixes'. > That is true. >> /** >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 2b037f195473..923a63262dfd 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev, >> * Both of these contain the full memory barrier required by >> * membarrier after storing to rq->curr, before returning to >> * user-space. >> + * >> + * If mm is NULL and oldmm is dying (!owner), we switch to >> + * init_mm instead to make sure that oldmm can be freed ASAP. >> */ >> - if (!mm) { >> + if (!mm && !mm_dying(oldmm)) { >> next->active_mm = oldmm; >> mmgrab(oldmm); >> enter_lazy_tlb(oldmm, next); >> - } else >> + } else { >> + if (!mm) { >> + mm = &init_mm; >> + next->active_mm = mm; >> + mmgrab(mm); >> + } >> switch_mm_irqs_off(oldmm, mm, next); >> + } >> >> if (!prev->mm) { >> prev->active_mm = NULL; > Bah, I see we _still_ haven't 'fixed' that code. And you're making an > even bigger mess of it. > > Let me go find where that cleanup went. It would be nice if there is a better solution. Cheers, Longman