On Thu, Aug 8, 2024 at 3:36 PM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Thu, Aug 8, 2024 at 3:16 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > > > On Fri, Aug 9, 2024 at 12:05 AM Andrii Nakryiko > > <andrii.nakryiko@xxxxxxxxx> wrote: > > > On Thu, Aug 8, 2024 at 2:43 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > > > > > > > On Thu, Aug 8, 2024 at 11:11 PM Andrii Nakryiko > > > > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > > On Thu, Aug 8, 2024 at 2:02 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > > > > > > > > > On Thu, Aug 8, 2024 at 8:19 PM Andrii Nakryiko > > > > > > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > > > > > > > > > > > On Wed, Aug 7, 2024 at 11:23 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > Add helper functions to speculatively perform operations without > > > > > > > > read-locking mmap_lock, expecting that mmap_lock will not be > > > > > > > > write-locked and mm is not modified from under us. > > > > > > > > > > > > > > > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> > > > > > > > > Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > > > > > > > > Cc: Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> > > > > > > > > --- > > > > > > > > > > > > > > This change makes sense and makes mm's seq a bit more useful and > > > > > > > meaningful. I've also tested it locally with uprobe stress-test, and > > > > > > > it seems to work great, I haven't run into any problems with a > > > > > > > multi-hour stress test run so far. Thanks! > > > > > > > > > > > > Thanks for testing and feel free to include this patch into your set. > > > > > > > > > > Will do! > > > > > > > > > > > > > > > > > I've been thinking about this some more and there is a very unlikely > > > > > > corner case if between mmap_lock_speculation_start() and > > > > > > mmap_lock_speculation_end() mmap_lock is write-locked/unlocked so many > > > > > > times that mm->mm_lock_seq (int) overflows and just happen to reach > > > > > > the same value as we recorded in mmap_lock_speculation_start(). This > > > > > > would generate a false positive, which would show up as if the > > > > > > mmap_lock was never touched. Such overflows are possible for vm_lock > > > > > > as well (see: https://elixir.bootlin.com/linux/v6.10.3/source/include/linux/mm_types.h#L688) > > > > > > but they are not critical because a false result would simply lead to > > > > > > a retry under mmap_lock. However for your case this would be a > > > > > > critical issue. This is an extremely low probability scenario but > > > > > > should we still try to handle it? > > > > > > > > > > > > > > > > No, I think it's fine. > > > > > > > > Modern computers don't take *that* long to count to 2^32, even when > > > > every step involves one or more syscalls. I've seen bugs where, for > > > > example, a 32-bit refcount is not decremented where it should, making > > > > it possible to overflow the refcount with 2^32 operations of some > > > > kind, and those have taken something like 3 hours to trigger in one > > > > case (https://bugs.chromium.org/p/project-zero/issues/detail?id=2478), > > > > 14 hours in another case. Or even cases where, if you have enough RAM, > > > > you can create 2^32 legitimate references to an object and overflow a > > > > refcount that way > > > > (https://bugs.chromium.org/p/project-zero/issues/detail?id=809 if you > > > > had more than 32 GiB of RAM, taking only 25 minutes to overflow the > > > > 32-bit counter - and that is with every step allocating memory). > > > > So I'd expect 2^32 simple operations that take the mmap lock for > > > > writing to be faster than 25 minutes on a modern desktop machine. > > > > > > > > So for a reader of some kinda 32-bit sequence count, if it is > > > > conceivably possible for the reader to take at least maybe a couple > > > > minutes or so between the sequence count reads (also counting time > > > > during which the reader is preempted or something like that), there > > > > could be a problem. At that point in the analysis, if you wanted to > > > > know whether it's actually exploitable, I guess you'd have to look at > > > > what kinda context you're running in, and what kinda events can > > > > interrupt/preempt you (like whether someone can send a sufficiently > > > > dense flood of IPIs to completely prevent you making forward progress, > > > > like in https://www.vusec.net/projects/ghostrace/), and for how long > > > > those things can delay you (maybe including what the pessimal > > > > scheduler behavior looks like if you're in preemptible context, or how > > > > long clock interrupts can take to execute when processing a giant pile > > > > of epoll watches), and so on... > > > > > > > > > > And here we are talking about *lockless* *speculative* VMA usage that > > > will last what, at most on the order of a few microseconds? > > > > Are you talking about time spent in task context, or time spent while > > the task is on the CPU (including time in interrupt context), or about > > wall clock time? > > We are doing, roughly: > > mmap_lock_speculation_start(); > rcu_read_lock(); > vma_lookup(); > rb_find(); > rcu_read_unlock(); > mmap_lock_speculation_end(); > > > On non-RT kernel this can be prolonged only by having an NMI somewhere > in the middle. On RT it can get preempted even within RCU locked > region, if I understand correctly. If you manage to make this part run > sufficiently long to overflow 31-bit counter, it's probably a bigger > problem than mmap's sequence wrapping over, no? I was also thinking that an easy way to strengthen the guarantee this overflow does not happen is to make mm->mm_lock_seq a 64-bit counter while keeping vma->vm_lock_seq as is. When comparing them we can use the lowest 32 bits of mm->mm_lock_seq without any loss of correctness. Seems easy enough (famous last words). If we decide to do that I will run performance tests to make sure performance does not suffer. > > > > > https://www.vusec.net/projects/ghostrace/ is pretty amazing - when you > > look at the paper > > https://download.vusec.net/papers/ghostrace_sec24.pdf you can see in > > Figure 4 how they managed to turn a race window that's 8 instructions > > wide into a window they can stretch "indefinitely", and they didn't > > even have to reschedule to pull it off. If I understand correctly, > > they stretched the race window to something like 35 seconds and could > > have stretched it even wider if they had wanted to? > > > > (And yes, Linux fixed the specific trick they used for doing that, but > > it still shows that this kinda thing is possible in principle.)