Re: [PATCH v4 1/4] mm: reduce the rcu lock duration

Miaohe Lin <linmiaohe@xxxxxxxxxx> · Tue, 7 Jun 2022 17:19:53 +0800

On 2022/6/4 0:28, Eric W. Biederman wrote:
> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:
> 
>> On 2022/6/1 22:37, Eric W. Biederman wrote:
>>> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:
>>>
>>>> On 2022/6/1 0:09, Eric W. Biederman wrote:
>>>>> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:
>>>> snip
>>>>>>
>>>>>> "
>>>>>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>>>>>> extends the period of the rcu_read_lock until after the permissions checks
>>>>>> are done because it suspects the permissions checks are not safe unless
>>>>>> performed under both rcu_read_lock and task_lock to ensure the task<->mm
>>>>>> association does not change on us while we are working [1]. But extended
>>>>>> rcu read lock does not add much value. Because after permission checking
>>>>>> the permission may still be changed. There's no much difference. So it's
>>>>>> unnecessary to extend the period of the rcu_read_lock. Release the rcu
>>>>>> lock after task refcount is successfully grabbed to reduce the rcu holding
>>>>>> time.
>>>>>>
>>>>>> [1] https://lore.kernel.org/lkml/87sjhzun47.fsf@xxxxxxxxxxxx/
>>>>>> "
>>>>>
>>>>> It doesn't make sense to me.
>>>>>
>>>>> I don't see any sleeping functions called from find_mm_struct or
>>>>> kernel_migrate_pages in the area kernel_migrate_pages in the area of the
>>>>> code protected by get_task_struct.  So at a very basic level I see a
>>>>> justification for dirtying a cache line twice with get_task_struct and
>>>>> put_task_struct to reduce rcu_read_lock hold times.
>>>>>
>>>>> I would contend that a reasonable cleanup based up on the current state
>>>>> of the code would be to extend the rcu_read_lock over get_task_mm so
>>>>
>>>> If so, security_task_movememory will be called inside rcu lock. It might
>>>> call sleeping functions, e.g. smack_log(). I think it's not a good
>>>> idea.
>>>
>>> In general the security functions are not allowed to sleep.
>>> The audit mechanism typically preallocates memory so it does
>>> not need to sleep.  From a quick skim through the code I don't
>>> see smack_log sleeping.
>>>
>>> Certainly the security hooks are not supposed to be inserted into the
>>> code that they prevent reasonable implementation decisions.
>>>
>>> Which is to say if there is a good (non-security hook reason) for
>>> supporting sleeping deal with it.  Otherwise the security hooks has a
>>> bug and needs to get fixed/removed.
>>
>> I see. Many thanks for explanation.
>>
>>>
>>>>> that a reference to task_struct does not need to be taken.  That has
>>>>> the potential to reduce contention and reduce lock hold times.
>>>>>
>>>>>
>>>>> The code is missing a big fat comment with the assertion that it is ok
>>>>> if the permission checks are racy because the race is small, and the
>>>>> worst case thing that happens is the page is migrated to another
>>>>> numa node.
>>>>>
>>>>>
>>>>> Given that the get_mm_task takes task_lock the cost of dirtying the
>>>>> cache line is already being paid.  Perhaps not extending task_lock hold
>>>>> times a little bit is justified, but I haven't seen that case made.
>>>>>
>>>>> This seems like code that is called little enough it would be better for
>>>>> it to be correct, and not need big fat comments explaining why it
>>>>> doesn't matter that they code is deliberately buggy.
>>>>>
>>>>
>>>> Agree. A big fat comments will make code hard to follow.
>>>
>>> No.
>>>
>>> The code is impossible to follow currently.
>>>
>>> The code either requires a comment pointing out that it is deliberately
>>> racy, or the code needs to be fixed.
>>>
>>> Clever and subtle code always requires a comment if for no other
>>> reason then to alert the reader that something a typical is going on.
>>
>> Yes, clever and subtle code requires a comment but others might not.
>>
>>>
>>>>> In short it does not make sense to me to justify a patch for performance
>>>>> reasons when it appears that extending the rcu_read_lock hold time and
>>>>> not touch the task reference count would stop dirtying a cache line and
>>>>> likely have more impact.
>>>>
>>>> IMHO, incremented task refcount should make code works correctly. And extending
>>>> the rcu_read_lock over get_task_mm will break the things because sleeping functions
>>>> might be called while holding rcu lock.
>>>
>>> Which sleeping functions?
>>>
>>> I can't find any.  In particular smack_task_movememory calls
>>> smk_curacc_on_task which is the same function called by
>>> security_task_getpgid.  security_task_getpgid is called
>>> under rcu_read_lock.  So smack won't sleep.
>>
>> Sorry, I didn't take a close look at smack_log code. So I thought it could sleep.
>>
>>>
>>>> Does the patch itself makes sense for you? Should I rephase the commit log further?
>>>> I'm afraid I didn't get your point correctly.
>>>
>>> The patch makes no sense to me because I don't see it doing anything
>>> worth doing.
>>>
>>> get/put_task_struct both dirty a cache line and are expensive especially
>>> when contended.  Dirtying a cache line that is contended is the pretty
>>> much the most expensive native cpu operation.  In pathological scenarios
>>> I have seen it take up to 1s.  Realistically in a cache cold scenario
>>> (which is not as bad as a contended scenario) you are looking at 100ns
>>> or more just to execute get_task_struct/put_task_struct.  That is the
>>> kind of cost it would be nice to avoid all together (even if the
>>> pathological case never comes up).
>>>
>>> So I see two pieces of code where we could use the cheap operation
>>> rcu_read_lock/rcu_read_unlock to remove the expensive operation
>>> get_task_struct/put_task_struct.  Instead I see people removing
>>> rcu_read_lock/rcu_read_unlock.
>>>
>>> That makes no sense.  Especially as your implied reason for making this
>>> change is to make the code have better performance.  Improving
>>> performance is the reason for making critical sections smaller isn't it?
>>>
>>
>> I think you're right. We should extend the rcu_read_lock over get_task_mm so we can
>> remove the expensive operation get_task_struct/put_task_struct and thus avoid possible
>> cacheline penalty. But is the extended rcu lock enough to ensure the task reference
>> is stable when calling security check functions and performing cpuset node validation?
>> Or this race is just OK and can be left alone with a comment?
> 
> The extending the rcu_read_lock is just about removing the expense of
> get_task_struct/put_task_struct.  It can be handled separately from
> other issues at play in this code.
> 
> 
> The rcu_read_lock guarantees that task does not go away.  The
> rcu_read_lock does not guarantee that task->mm does not change.
> 

It seems task_lock is needed in this case.

> 
> 
> A separate but related issue is should the task_lock in get_task_mm
> be extended to cover the security checks so that everything checks
> against the same mm.
> 
> Possibly the code could be refactored or reordered so that everything
> is guaranteed to check against the same mm.

IMHO, it might be overkill to do this.

> 
> 
> If the checks are not made to guarantee they are all checking against
> the same mm, the code needs a comment to say that there is a tiny race.
> The comment should say we don't care about the tiny race because
> the worst that can happen is that a page is migrated to a different
> numa node.  And so we don't care.
> 
> 

I tend to do this one. To make sure, is the below code change what you suggest?

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3c64f3cdac4b..943f58eff1fc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1612,42 +1612,43 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
        rcu_read_lock();
        task = pid ? find_task_by_vpid(pid) : current;
        if (!task) {
-               rcu_read_unlock();
                err = -ESRCH;
-               goto out;
+               goto out_unlock;
        }
-       get_task_struct(task);
-       rcu_read_unlock();

        err = -EINVAL;

        /*
         * Check if this process has the right to modify the specified process.
         * Use the regular "ptrace_may_access()" checks.
+        * The below checks are not made to guarantee they are all checking
+        * against the same mm. But we don't care about such tiny race because
+        * the worst that can happen is that a page is migrated to a different
+        * numa node. And so we don't care.
         */
        if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
                err = -EPERM;
-               goto out_put;
+               goto out_unlock;
        }

        task_nodes = cpuset_mems_allowed(task);
        /* Is the user allowed to access the target nodes? */
        if (!nodes_subset(*new, task_nodes) && !capable(CAP_SYS_NICE)) {
                err = -EPERM;
-               goto out_put;
+               goto out_unlock;
        }

        task_nodes = cpuset_mems_allowed(current);
        nodes_and(*new, *new, task_nodes);
        if (nodes_empty(*new))
-               goto out_put;
+               goto out_unlock;

        err = security_task_movememory(task);
        if (err)
-               goto out_put;
+               goto out_unlock;

        mm = get_task_mm(task);
-       put_task_struct(task);
+       rcu_read_unlock();

        if (!mm) {
                err = -EINVAL;
@@ -1663,8 +1664,8 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,

        return err;

-out_put:
-       put_task_struct(task);
+out_unlock:
+       rcu_read_unlock();
        goto out;

 }

> Eric

Many thanks for your valuable suggestion!

> .
>