On 2022/6/4 0:28, Eric W. Biederman wrote: > Miaohe Lin <linmiaohe@xxxxxxxxxx> writes: > >> On 2022/6/1 22:37, Eric W. Biederman wrote: >>> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes: >>> >>>> On 2022/6/1 0:09, Eric W. Biederman wrote: >>>>> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes: >>>> snip >>>>>> >>>>>> " >>>>>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct") >>>>>> extends the period of the rcu_read_lock until after the permissions checks >>>>>> are done because it suspects the permissions checks are not safe unless >>>>>> performed under both rcu_read_lock and task_lock to ensure the task<->mm >>>>>> association does not change on us while we are working [1]. But extended >>>>>> rcu read lock does not add much value. Because after permission checking >>>>>> the permission may still be changed. There's no much difference. So it's >>>>>> unnecessary to extend the period of the rcu_read_lock. Release the rcu >>>>>> lock after task refcount is successfully grabbed to reduce the rcu holding >>>>>> time. >>>>>> >>>>>> [1] https://lore.kernel.org/lkml/87sjhzun47.fsf@xxxxxxxxxxxx/ >>>>>> " >>>>> >>>>> It doesn't make sense to me. >>>>> >>>>> I don't see any sleeping functions called from find_mm_struct or >>>>> kernel_migrate_pages in the area kernel_migrate_pages in the area of the >>>>> code protected by get_task_struct. So at a very basic level I see a >>>>> justification for dirtying a cache line twice with get_task_struct and >>>>> put_task_struct to reduce rcu_read_lock hold times. >>>>> >>>>> I would contend that a reasonable cleanup based up on the current state >>>>> of the code would be to extend the rcu_read_lock over get_task_mm so >>>> >>>> If so, security_task_movememory will be called inside rcu lock. It might >>>> call sleeping functions, e.g. smack_log(). I think it's not a good >>>> idea. >>> >>> In general the security functions are not allowed to sleep. >>> The audit mechanism typically preallocates memory so it does >>> not need to sleep. From a quick skim through the code I don't >>> see smack_log sleeping. >>> >>> Certainly the security hooks are not supposed to be inserted into the >>> code that they prevent reasonable implementation decisions. >>> >>> Which is to say if there is a good (non-security hook reason) for >>> supporting sleeping deal with it. Otherwise the security hooks has a >>> bug and needs to get fixed/removed. >> >> I see. Many thanks for explanation. >> >>> >>>>> that a reference to task_struct does not need to be taken. That has >>>>> the potential to reduce contention and reduce lock hold times. >>>>> >>>>> >>>>> The code is missing a big fat comment with the assertion that it is ok >>>>> if the permission checks are racy because the race is small, and the >>>>> worst case thing that happens is the page is migrated to another >>>>> numa node. >>>>> >>>>> >>>>> Given that the get_mm_task takes task_lock the cost of dirtying the >>>>> cache line is already being paid. Perhaps not extending task_lock hold >>>>> times a little bit is justified, but I haven't seen that case made. >>>>> >>>>> This seems like code that is called little enough it would be better for >>>>> it to be correct, and not need big fat comments explaining why it >>>>> doesn't matter that they code is deliberately buggy. >>>>> >>>> >>>> Agree. A big fat comments will make code hard to follow. >>> >>> No. >>> >>> The code is impossible to follow currently. >>> >>> The code either requires a comment pointing out that it is deliberately >>> racy, or the code needs to be fixed. >>> >>> Clever and subtle code always requires a comment if for no other >>> reason then to alert the reader that something a typical is going on. >> >> Yes, clever and subtle code requires a comment but others might not. >> >>> >>>>> In short it does not make sense to me to justify a patch for performance >>>>> reasons when it appears that extending the rcu_read_lock hold time and >>>>> not touch the task reference count would stop dirtying a cache line and >>>>> likely have more impact. >>>> >>>> IMHO, incremented task refcount should make code works correctly. And extending >>>> the rcu_read_lock over get_task_mm will break the things because sleeping functions >>>> might be called while holding rcu lock. >>> >>> Which sleeping functions? >>> >>> I can't find any. In particular smack_task_movememory calls >>> smk_curacc_on_task which is the same function called by >>> security_task_getpgid. security_task_getpgid is called >>> under rcu_read_lock. So smack won't sleep. >> >> Sorry, I didn't take a close look at smack_log code. So I thought it could sleep. >> >>> >>>> Does the patch itself makes sense for you? Should I rephase the commit log further? >>>> I'm afraid I didn't get your point correctly. >>> >>> The patch makes no sense to me because I don't see it doing anything >>> worth doing. >>> >>> get/put_task_struct both dirty a cache line and are expensive especially >>> when contended. Dirtying a cache line that is contended is the pretty >>> much the most expensive native cpu operation. In pathological scenarios >>> I have seen it take up to 1s. Realistically in a cache cold scenario >>> (which is not as bad as a contended scenario) you are looking at 100ns >>> or more just to execute get_task_struct/put_task_struct. That is the >>> kind of cost it would be nice to avoid all together (even if the >>> pathological case never comes up). >>> >>> So I see two pieces of code where we could use the cheap operation >>> rcu_read_lock/rcu_read_unlock to remove the expensive operation >>> get_task_struct/put_task_struct. Instead I see people removing >>> rcu_read_lock/rcu_read_unlock. >>> >>> That makes no sense. Especially as your implied reason for making this >>> change is to make the code have better performance. Improving >>> performance is the reason for making critical sections smaller isn't it? >>> >> >> I think you're right. We should extend the rcu_read_lock over get_task_mm so we can >> remove the expensive operation get_task_struct/put_task_struct and thus avoid possible >> cacheline penalty. But is the extended rcu lock enough to ensure the task reference >> is stable when calling security check functions and performing cpuset node validation? >> Or this race is just OK and can be left alone with a comment? > > The extending the rcu_read_lock is just about removing the expense of > get_task_struct/put_task_struct. It can be handled separately from > other issues at play in this code. > > > The rcu_read_lock guarantees that task does not go away. The > rcu_read_lock does not guarantee that task->mm does not change. > It seems task_lock is needed in this case. > > > A separate but related issue is should the task_lock in get_task_mm > be extended to cover the security checks so that everything checks > against the same mm. > > Possibly the code could be refactored or reordered so that everything > is guaranteed to check against the same mm. IMHO, it might be overkill to do this. > > > If the checks are not made to guarantee they are all checking against > the same mm, the code needs a comment to say that there is a tiny race. > The comment should say we don't care about the tiny race because > the worst that can happen is that a page is migrated to a different > numa node. And so we don't care. > > I tend to do this one. To make sure, is the below code change what you suggest? diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 3c64f3cdac4b..943f58eff1fc 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1612,42 +1612,43 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode, rcu_read_lock(); task = pid ? find_task_by_vpid(pid) : current; if (!task) { - rcu_read_unlock(); err = -ESRCH; - goto out; + goto out_unlock; } - get_task_struct(task); - rcu_read_unlock(); err = -EINVAL; /* * Check if this process has the right to modify the specified process. * Use the regular "ptrace_may_access()" checks. + * The below checks are not made to guarantee they are all checking + * against the same mm. But we don't care about such tiny race because + * the worst that can happen is that a page is migrated to a different + * numa node. And so we don't care. */ if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) { err = -EPERM; - goto out_put; + goto out_unlock; } task_nodes = cpuset_mems_allowed(task); /* Is the user allowed to access the target nodes? */ if (!nodes_subset(*new, task_nodes) && !capable(CAP_SYS_NICE)) { err = -EPERM; - goto out_put; + goto out_unlock; } task_nodes = cpuset_mems_allowed(current); nodes_and(*new, *new, task_nodes); if (nodes_empty(*new)) - goto out_put; + goto out_unlock; err = security_task_movememory(task); if (err) - goto out_put; + goto out_unlock; mm = get_task_mm(task); - put_task_struct(task); + rcu_read_unlock(); if (!mm) { err = -EINVAL; @@ -1663,8 +1664,8 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode, return err; -out_put: - put_task_struct(task); +out_unlock: + rcu_read_unlock(); goto out; } > Eric Many thanks for your valuable suggestion! > . >