On Tue, Nov 28, 2023 at 03:11:06PM +0100, Michal Hocko wrote: > On Wed 22-11-23 16:11:55, Gregory Price wrote: > [...] > > + * Like get_vma_policy and get_task_policy, must hold alloc/task_lock > > + * while calling this. > > + */ > > +static struct mempolicy *get_task_vma_policy(struct task_struct *task, > > + struct vm_area_struct *vma, > > + unsigned long addr, int order, > > + pgoff_t *ilx) > [...] > > You should add lockdep annotation for alloc_lock/task_lock here for clarity and > also... > > @@ -1844,16 +1899,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, > > struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > > unsigned long addr, int order, pgoff_t *ilx) > > { > > - struct mempolicy *pol; > > - > > - pol = __get_vma_policy(vma, addr, ilx); > > - if (!pol) > > - pol = get_task_policy(current); > > - if (pol->mode == MPOL_INTERLEAVE) { > > - *ilx += vma->vm_pgoff >> order; > > - *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); > > - } > > - return pol; > > + return get_task_vma_policy(current, vma, addr, order, ilx); > > I do not think that all get_vma_policy take task_lock (just random check > dequeue_hugetlb_folio_vma->huge_node->get_vma_policy AFAICS) > hm, i might have gotten turned around on this one. Forgot to check for external references to get_vma_policy. I thought I considered it, but i clearly did not leave myself any notes if I did. This pattern is troublesome, we're holding the task lock during the callback stack in __get_vma_policy - just incase that returns NULL so we can return the task policy instead. If that vma is shared, it will take the vma shared policy lock (sp->lock) I almost want to change this interface to return NULL if the VMA doesn't have one, and change callers to fetch the task policy explicitly instead of implicitly returning the task policy. At least then we'd only take the task lock on an explicit access to the *Task* policy. > Also I do not see policy_nodemask to be handled anywhere. That one is > used along with get_vma_policy (sometimes hidden like in > alloc_pages_mpol). It has a dependency on > cpuset_nodemask_valid_mems_allowed. That means that e.g. mbind on a > remote task would be constrained by current task cpuset when allocating > migration targets for the target task. I am wondering how many other > dependencies like that are lurking there. bah! thought i dug all these out, but i missed alloc_migration_target_by_mpol from do_mbind. I'll need to take another look at the calls to cpusets interfaces to make sure i dig this out. The number of hidden accesses to current is really nasty :[ > -- > Michal Hocko > SUSE Labs