On Tue, Nov 28, 2023 at 03:11:06PM +0100, Michal Hocko wrote: > On Wed 22-11-23 16:11:55, Gregory Price wrote: > [...] > > + * Like get_vma_policy and get_task_policy, must hold alloc/task_lock > > + * while calling this. > > + */ > > +static struct mempolicy *get_task_vma_policy(struct task_struct *task, > > + struct vm_area_struct *vma, > > + unsigned long addr, int order, > > + pgoff_t *ilx) > [...] > > You should add lockdep annotation for alloc_lock/task_lock here for clarity and > also... > > @@ -1844,16 +1899,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, > > struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > > unsigned long addr, int order, pgoff_t *ilx) > > { > > - struct mempolicy *pol; > > - > > - pol = __get_vma_policy(vma, addr, ilx); > > - if (!pol) > > - pol = get_task_policy(current); > > - if (pol->mode == MPOL_INTERLEAVE) { > > - *ilx += vma->vm_pgoff >> order; > > - *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); > > - } > > - return pol; > > + return get_task_vma_policy(current, vma, addr, order, ilx); > > I do not think that all get_vma_policy take task_lock (just random check > dequeue_hugetlb_folio_vma->huge_node->get_vma_policy AFAICS) > > Also I do not see policy_nodemask to be handled anywhere. That one is > used along with get_vma_policy (sometimes hidden like in > alloc_pages_mpol). It has a dependency on > cpuset_nodemask_valid_mems_allowed. That means that e.g. mbind on a > remote task would be constrained by current task cpuset when allocating > migration targets for the target task. I am wondering how many other > dependencies like that are lurking there. So after further investigation, I'm going to have to back out the changes that make home_node and mbind modifiable by an external task and revisit it at a later time. Right now, there's a very nasty rats nest of entanglement between mempolicy and vma/shmem that hides a bunch of accesses to current. It only becomes apparently when you start chasing all the callers of mpol_dup, which had another silent access to current->cpusets. mpol_dup calls the following: current_cpuset_is_being_rebound cpuset_mems_allowed(current) So we would need to do the following 1) create mpol_dup_task and make current explicit, not implicit 2) chase down all callers to mpol_dup and make sure it isn't generated from any of the task interfaces 3) if it is generated from the task interfaces, plumb a reference to current down through... somehow... if possible... Here's a ~1 hour chase that lead me to the conclusion that this will take considerably more work, and is not to be taken lightly: do_mbind mbind_range vma_modify_policy split_vma __split_vma vma_dup_policy mpol_dup vma_replace_policy mpol_dup vma->vm_ops->set_policy - see below __set_mempolicy_home_node mbind_range ... same as above ... digging into vma->vm_ops->set_policy we end up in mm/shmem.c shmem_set_policy mpol_set_shared_policy sp_alloc mpol_dup current_cpuset_is_being_rebound() cpuset_mems_allowed(current) Who knows what else is burried in the vma stack, but making vma mempolicies externally modifiable looks to be a much more monumental task than just simply making the task policy modifiable. For now i'm going to submit a V2 with home_node and mbind removed from the proposal. Those will take far more investigation. This also means that process_set_mempolicy should not be extended to allow for vma policy replacements. ~Gregory