On 8/20/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > On 8/20/23, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >> On Sun, Aug 20, 2023 at 12:43:03PM +0200, Mateusz Guzik wrote: >>> Should the trylock succeed (and thus blocking was avoided), the routine >>> wants to ensure blocking was still legal to do. However, the method >>> used ends up calling __cond_resched injecting a voluntary preemption >>> point with the freshly acquired lock. >>> >>> One can hack around it using __might_sleep instead of mere might_sleep, >>> but since threads keep going off CPU here, I figured it is better to >>> accomodate it. >> >> Except now we search the exception tables every time we call it. >> The now-deleted comment (c2508ec5a58d) suggests this is slow: >> > > I completely agree it a rather unfortunate side-effect. > >> - /* >> - * Kernel-mode access to the user address space should only occur >> - * on well-defined single instructions listed in the exception >> - * tables. But, an erroneous kernel fault occurring outside one >> of >> - * those areas which also holds mmap_lock might deadlock >> attempting >> - * to validate the fault against the address space. >> - * >> - * Only do the expensive exception table search when we might be >> at >> - * risk of a deadlock. This happens if we >> - * 1. Failed to acquire mmap_lock, and >> - * 2. The access did not originate in userspace. >> - */ >> >> Now, this doesn't mean we're doing it on every page fault. We skip >> all of this if we're able to handle the fault under the VMA lock, >> so the effect is probably much smaller than it was. But I'm surprised >> not to see you send any data quantifying the effect of this change! >> > > Going off CPU *after* taking the lock sounds like an obviously bad > thing to happen and as such I did not think it warrants any > measurements. > > My first patch looked like this: > diff --git a/mm/memory.c b/mm/memory.c > index 1ec1ef3418bf..8662fd69eae8 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5259,7 +5259,9 @@ static inline bool > get_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs > { > /* Even if this succeeds, make it clear we *might* have slept */ > if (likely(mmap_read_trylock(mm))) { > - might_sleep(); > +#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) > + __might_sleep(__FILE__, __LINE__); > +#endif > return true; > } > > This keeps assertions while dodging __cond_resched. > > But then I figured someone may complain about scheduling latency which > was not there prior to the patch. > > Between the 2 not so great choices I rolled with what I considered the > safer one. > > However, now that I said it, I wonder if perhaps the search could be > circumvented on x86-64? The idea would be to check if SMAP got > disabled (and assuming the CPU supports it) -- if so and the faulting > address belongs to userspace, assume it's all good. This is less > precise in that SMAP can be disabled over the intended users access > but would be fine as far as I'm concerned if the search is such a big > deal. > Oof, hit send too fast. This is less precise in that SMAP can be disabled over A LARGER AREA THAN the intended users access but would be fine as far as I'm concerned if the search is such a big. there :) Anyhow I don't feel strongly about any of this, I was mostly interested in what happens with VFS on the off-CPU front and this one is just a random thing I needed to check. Now that I elaborated on my $0,03 I'm happy to respin with the __might_sleep variant. If someone wants a different fix altogether they are welcome to ignore these patches. I do claim the current state *is* a problem though -- it can block down_write for no good reason. -- Mateusz Guzik <mjguzik gmail.com>