Re: [PATCH] mm: remove unintentional voluntary preemption in get_mmap_lock_carefully

Mateusz Guzik <mjguzik@xxxxxxxxx> · Sun, 20 Aug 2023 14:46:26 +0200

On 8/20/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> On 8/20/23, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>> On Sun, Aug 20, 2023 at 12:43:03PM +0200, Mateusz Guzik wrote:
>>> Should the trylock succeed (and thus blocking was avoided), the routine
>>> wants to ensure blocking was still legal to do. However, the method
>>> used ends up calling __cond_resched injecting a voluntary preemption
>>> point with the freshly acquired lock.
>>>
>>> One can hack around it using __might_sleep instead of mere might_sleep,
>>> but since threads keep going off CPU here, I figured it is better to
>>> accomodate it.
>>
>> Except now we search the exception tables every time we call it.
>> The now-deleted comment (c2508ec5a58d) suggests this is slow:
>>
>
> I completely agree it a rather unfortunate side-effect.
>
>> -       /*
>> -        * Kernel-mode access to the user address space should only occur
>> -        * on well-defined single instructions listed in the exception
>> -        * tables.  But, an erroneous kernel fault occurring outside one
>> of
>> -        * those areas which also holds mmap_lock might deadlock
>> attempting
>> -        * to validate the fault against the address space.
>> -        *
>> -        * Only do the expensive exception table search when we might be
>> at
>> -        * risk of a deadlock.  This happens if we
>> -        * 1. Failed to acquire mmap_lock, and
>> -        * 2. The access did not originate in userspace.
>> -        */
>>
>> Now, this doesn't mean we're doing it on every page fault.  We skip
>> all of this if we're able to handle the fault under the VMA lock,
>> so the effect is probably much smaller than it was.  But I'm surprised
>> not to see you send any data quantifying the effect of this change!
>>
>
> Going off CPU *after* taking the lock sounds like an obviously bad
> thing to happen and as such I did not think it warrants any
> measurements.
>
> My first patch looked like this:
> diff --git a/mm/memory.c b/mm/memory.c
> index 1ec1ef3418bf..8662fd69eae8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5259,7 +5259,9 @@ static inline bool
> get_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs
>  {
>         /* Even if this succeeds, make it clear we *might* have slept */
>         if (likely(mmap_read_trylock(mm))) {
> -               might_sleep();
> +#if defined(CONFIG_DEBUG_ATOMIC_SLEEP)
> +               __might_sleep(__FILE__, __LINE__);
> +#endif
>                 return true;
>         }
>
> This keeps assertions while dodging __cond_resched.
>
> But then I figured someone may complain about scheduling latency which
> was not there prior to the patch.
>
> Between the 2 not so great choices I rolled with what I considered the
> safer one.
>
> However, now that I said it, I wonder if perhaps the search could be
> circumvented on x86-64? The idea would be to check if SMAP got
> disabled (and assuming the CPU supports it) -- if so and the faulting
> address belongs to userspace, assume it's all good. This is less
> precise in that SMAP can be disabled over the intended users access
> but would be fine as far as I'm concerned if the search is such a big
> deal.
>

Oof, hit send too fast.

This is less precise in that SMAP can be disabled over A LARGER AREA
THAN the intended users access but would be fine as far as I'm
concerned if the search is such a big.

there :)

Anyhow I don't feel strongly about any of this, I was mostly
interested in what happens with VFS on the off-CPU front and this one
is just a random thing I needed to check.

Now that I elaborated on my $0,03 I'm happy to respin with the
__might_sleep variant. If someone wants a different fix altogether
they are welcome to ignore these patches.

I do claim the current state *is* a problem though -- it can block
down_write for no good reason.

-- 
Mateusz Guzik <mjguzik gmail.com>