On Fri, 11 Nov 2022 16:40:51 +0800 Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx> wrote: > Page allocation usage of task or vma policy occurs in the fault > path where we hold the mmap_lock for read. because replacing the > task or vma policy requires that the mmap_lock be held for write, > the policy can't be freed out from under us while we're using > it for page allocation. But there are some corner cases(e.g. > alloc_pages()) which not acquire any lock for read during the > page allocation. For this reason, task_work is used in > mpol_put_async() to free mempolicy in pidfd_set_mempolicy(). > Thuse, it avoids into race conditions. This sounds a bit suspicious. Please share much more detail about these races. If we proced with this design then mpol_put_async() shouild have comments which fully describe the need for the async free. How do we *know* that these races are fully prevented with this approach? How do we know that mpol_put_async() won't free the data until the race window has fully passed? Also, in some situations mpol_put_async() will free the data synchronously anyway, so aren't these races still present? Secondly, why was the `flags' argument added? We might use it one day? For what purpose? I mean, every syscall could have a does-nothing `flags' arg, but we don't do that. What's the plan here?