Hi Andrew, thanks for your replay.
This sounds a bit suspicious. Please share much more detail about
these races. If we proced with this design then mpol_put_async()
shouild have comments which fully describe the need for the async free.
How do we *know* that these races are fully prevented with this
approach? How do we know that mpol_put_async() won't free the data
until the race window has fully passed?
A mempolicy can be either associated with a process or with a VMA.
All vma manipulation is somewhat protected by a down_read on
mmap_lock.In process context there is no locking because only
the process accesses its own state before.
Now we need to change the process context mempolicy specified
in pidfd. the mempolicy may about to be freed by
pidfd_set_mempolicy() while alloc_pages() is using it,
the race condition appears.
process context mempolicy is used in:
alloc_pages()
alloc_pages_bulk_array_mempolicy()
policy_mbind_nodemask()
mempolicy_slab_node()
.....
Say something like the following:
pidfd_set_mempolicy() target task stack:
alloc_pages:
mpol = p->mempolicy;
task_lock(task);
old = task->mempolicy;
task->mempolicy = new;
task_unlock(task);
mpol_put(old);
/*old mpol has been freed.*/
policy_node(...., mpol)
__alloc_pages(mpol);
To reduce the use of locks and atomic operations(mpol_get/put)
in the hot path,task_work is used in mpol_put_async(),
when the target task exit to user mode, the process context
mempolicy is not used anymore, mpol_free_async()
will be called as task_work to free mempolicy in
target context.
Also, in some situations mpol_put_async() will free the data
synchronously anyway, so aren't these races still present?
If the task has run exit_task_work(),task_work_add() will fail.
we can free the mempolicy directly because mempolicy is not used.
Secondly, why was the `flags' argument added? We might use it one day?
For what purpose? I mean, every syscall could have a does-nothing
`flags' arg, but we don't do that. What's the plan here?
I found that some functions use 'flags' for scalability, such
as process_madvise(), set_mempolicy_home_node(). back to our case, This
operation has per-thread rather than per-process semantic ,we could use
flags to switch for future extension if any. but I'm not sure.
Thanks.