Re: [PATCH] mm/mempolicy: fix lock contention on mems_allowed

Abel Wu <wuyun.abel@xxxxxxxxxxxxx> · Thu, 11 Aug 2022 16:43:28 +0800

On 8/9/22 8:11 PM, Michal Hocko Wrote:
On Tue 09-08-22 18:49:27, Abel Wu wrote:
The mems_allowed field can be modified by other tasks, so it
isn't safe to access it with alloc_lock unlocked even in the
current process context.

It would be useful to describe the racing scenario and the effect it
would have. 78b132e9bae9 hasn't really explained thinking behind and why
it was considered safe to drop the lock. I assume it was based on the
fact that the operation happens on the current task but this is hard to
tell.

Sorry for my poor description. Say there are two tasks: A from cpusetA
is performing set_mempolicy(2), and B is changing cpusetA's cpuset.mems.

    A (set_mempolicy)		B (echo xx > cpuset.mems)

    pol = mpol_new();
				update_tasks_nodemask(cpusetA) {
				  foreach t in cpusetA {
				    cpuset_change_task_nodemask(t) {
				      task_lock(t); // t could be A
    mpol_set_nodemask(pol) {
      new = f(A->mems_allowed);
				      update t->mems_allowed;
      pol.create(pol, new);
    }
				      task_unlock(t);
    task_lock(A);
    A->mempolicy = pol;
    task_unlock(A);
				    }
				  }
				}

In this case A's pol->nodes is computed by old mems_allowed, and could
be inconsistent with A's new mems_allowed.

While it is different when replacing vmas' policy: the pol->nodes is
gone wild only when current_cpuset_is_being_rebound():

    A (mbind)			B (echo xx > cpuset.mems)

				cpuset_being_rebound = cpusetA;
				update_tasks_nodemask(cpusetA) {
				  foreach t in cpusetA {
				    cpuset_change_task_nodemask(t) {
				      task_lock(t); // t could be A
    pol = mpol_new();
    mmap_write_lock(A->mm);
    mpol_set_nodemask(pol) {
      mask = f(A->mems_allowed);
				      update t->mems_allowed;
      pol.create(pol, mask);
    }
				      task_unlock(t);
				    }
    foreach v in A->mm {
      if (current_cpuset_is_being_rebound())
        pol.rebind(pol, cpuset.mems);
      v->vma_policy = pol;
    }
    mmap_write_unlock(A->mm);
				    mmap_write_lock(t->mm);
				    mpol_rebind_mm(t->mm);
				    mmap_write_unlock(t->mm);
				  }
				}
				cpuset_being_rebound = NULL;

In this case, the cpuset.mems, which has already done updating, is
finally used for calculating pol->nodes, rather than A->mems_allowed.
So it is OK to call mpol_set_nodemask() with alloc_lock unlocked when
doing mbind(2).

Best Regards,
Abel