On Mon 14-11-22 00:41:21, Zhongkun He wrote: > Hi Andrew, thanks for your replay. > > > This sounds a bit suspicious. Please share much more detail about > > these races. If we proced with this design then mpol_put_async() > > shouild have comments which fully describe the need for the async free. > > > > How do we *know* that these races are fully prevented with this > > approach? How do we know that mpol_put_async() won't free the data > > until the race window has fully passed? > > A mempolicy can be either associated with a process or with a VMA. > All vma manipulation is somewhat protected by a down_read on > mmap_lock.In process context there is no locking because only > the process accesses its own state before. We shouldn't really rely on mmap_sem for this IMO. There is alloc_lock (aka task lock) that makes sure the policy is stable so that caller can atomically take a reference and hold on the policy. And we do not do that consistently and this should be fixed. E.g. just looking at some random places like allowed_mems_nr (relying on get_task_policy) is completely lockless and some paths (like fadvise) do not use any of the explicit (alloc_lock) or implicit (mmap_lock) locking. That means that the task_work based approach cannot really work in this case, right? Playing more tricks will not really help long term. So while your patch tries to workaround the current state of the art I do not think we really want that. As stated previously, I would much rather see proper reference counting instead. I thought we have agreed this would be the first approach unless the resulting performance is really bad. Have you concluded that to be the case? I do not see any numbers or notes in the changelog. -- Michal Hocko SUSE Labs