Re: [RFC 1/1] mm: introduce mmap_lock_speculation_{start|end}

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 8 Aug 2024 15:05:05 -0700

On Thu, Aug 8, 2024 at 2:43 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
>
> On Thu, Aug 8, 2024 at 11:11 PM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> > On Thu, Aug 8, 2024 at 2:02 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > >
> > > On Thu, Aug 8, 2024 at 8:19 PM Andrii Nakryiko
> > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > >
> > > > On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > > > >
> > > > > Add helper functions to speculatively perform operations without
> > > > > read-locking mmap_lock, expecting that mmap_lock will not be
> > > > > write-locked and mm is not modified from under us.
> > > > >
> > > > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> > > > > Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > > > > Cc: Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx>
> > > > > ---
> > > >
> > > > This change makes sense and makes mm's seq a bit more useful and
> > > > meaningful. I've also tested it locally with uprobe stress-test, and
> > > > it seems to work great, I haven't run into any problems with a
> > > > multi-hour stress test run so far. Thanks!
> > >
> > > Thanks for testing and feel free to include this patch into your set.
> >
> > Will do!
> >
> > >
> > > I've been thinking about this some more and there is a very unlikely
> > > corner case if between mmap_lock_speculation_start() and
> > > mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so many
> > > times that mm->mm_lock_seq (int) overflows and just happen to reach
> > > the same value as we recorded in mmap_lock_speculation_start(). This
> > > would generate a false positive, which would show up as if the
> > > mmap_lock was never touched. Such overflows are possible for vm_lock
> > > as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/include/linux/mm_types.h#L688)
> > > but they are not critical because a false result would simply lead to
> > > a retry under mmap_lock. However for your case this would be a
> > > critical issue. This is an extremely low probability scenario but
> > > should we still try to handle it?
> > >
> >
> > No, I think it's fine.
>
> Modern computers don't take *that* long to count to 2^32, even when
> every step involves one or more syscalls. I've seen bugs where, for
> example, a 32-bit refcount is not decremented where it should, making
> it possible to overflow the refcount with 2^32 operations of some
> kind, and those have taken something like 3 hours to trigger in one
> case (https://bugs.chromium.org/p/project-zero/issues/detail?id=2478),
> 14 hours in another case. Or even cases where, if you have enough RAM,
> you can create 2^32 legitimate references to an object and overflow a
> refcount that way
> (https://bugs.chromium.org/p/project-zero/issues/detail?id=809 if you
> had more than 32 GiB of RAM, taking only 25 minutes to overflow the
> 32-bit counter - and that is with every step allocating memory).
> So I'd expect 2^32 simple operations that take the mmap lock for
> writing to be faster than 25 minutes on a modern desktop machine.
>
> So for a reader of some kinda 32-bit sequence count, if it is
> conceivably possible for the reader to take at least maybe a couple
> minutes or so between the sequence count reads (also counting time
> during which the reader is preempted or something like that), there
> could be a problem. At that point in the analysis, if you wanted to
> know whether it's actually exploitable, I guess you'd have to look at
> what kinda context you're running in, and what kinda events can
> interrupt/preempt you (like whether someone can send a sufficiently
> dense flood of IPIs to completely prevent you making forward progress,
> like in https://www.vusec.net/projects/ghostrace/), and for how long
> those things can delay you (maybe including what the pessimal
> scheduler behavior looks like if you're in preemptible context, or how
> long clock interrupts can take to execute when processing a giant pile
> of epoll watches), and so on...
>

And here we are talking about *lockless* *speculative* VMA usage that
will last what, at most on the order of a few microseconds? So I stand
by "can never happen", because if it does, your system is so
overloaded that something like this uprobe issue is your least
concern.

> > Similar problems could happen with refcount_t,
> > for instance (it has a logic to have a sticky "has overflown" state,
> > which I believe relies on the fact that we'll never be able to
> > increment refcount 2bln+ times in between some resetting logic).
> > Anyways, I think it's utterly unrealistic and should be considered
> > impossible.
>
> IIRC refcount_t protects against this even in theoretical, fairly
> pessimal scenarios, because the maximum number of tasks you can have
> on Linux is smaller than the number of refcount decrements you'd have
> to do in parallel to bring a pinned refcount back down to 0.
>
> I know this is a weakness of seqcount_t (though last time I checked I
> couldn't find any examples where it seemed like you could actually
> abuse this).
>
> But if you want a counter, and something bad would happen if the
> counter wraps, and you don't have a really strong guarantee that the
> counter won't wrap, I think it's more robust to make it 64-bit. (Or an
> unsigned long and hope there aren't too many people who still run
> 32-bit kernels on anything important... though that's not very
> pretty.)