On Tue, Oct 22, 2024 at 09:57:39PM +0200, Jann Horn wrote: > On Tue, Oct 22, 2024 at 9:35 PM Lorenzo Stoakes > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > On Tue, Oct 22, 2024 at 09:08:53PM +0200, Jann Horn wrote: > > > On Mon, Oct 21, 2024 at 10:46 PM Vlastimil Babka <vbabka@xxxxxxx> wrote: > > > > On 10/21/24 22:27, Lorenzo Stoakes wrote: > > > > > On Mon, Oct 21, 2024 at 10:11:29PM +0200, Vlastimil Babka wrote: > > > > >> On 10/20/24 18:20, Lorenzo Stoakes wrote: > > > > >> > + while (true) { > > > > >> > + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ > > > > >> > + err = walk_page_range_mm(vma->vm_mm, start, end, > > > > >> > + &guard_poison_walk_ops, NULL); > > > > >> > + if (err <= 0) > > > > >> > + return err; > > > > >> > + > > > > >> > + /* > > > > >> > + * OK some of the range have non-guard pages mapped, zap > > > > >> > + * them. This leaves existing guard pages in place. > > > > >> > + */ > > > > >> > + zap_page_range_single(vma, start, end - start, NULL); > > > > >> > > > > >> ... however the potentially endless loop doesn't seem great. Could a > > > > >> malicious program keep refaulting the range (ignoring any segfaults if it > > > > >> loses a race) with one thread while failing to make progress here with > > > > >> another thread? Is that ok because it would only punish itself? > > > > > > > > > > Sigh. Again, I don't think you've read the previous series have you? Or > > > > > even the changelog... I added this as Jann asked for it. Originally we'd > > > > > -EAGAIN if we got raced. See the discussion over in v1 for details. > > > > > > > > > > I did it that way specifically to avoid such things, but Jann didn't appear > > > > > to think it was a problem. > > > > > > > > If Jann is fine with this then it must be secure enough. > > > > > > My thinking there was: > > > > > > We can legitimately race with adjacent faults populating the area > > > we're operating on with THP pages; as long as the zapping and > > > poison-marker-setting are separate, *someone* will have to do the > > > retry. Either we do it in the kernel, or we tell userspace to handle > > > it, but having the kernel take care of it is preferable because it > > > makes the stable UAPI less messy. > > > > > > One easy way to do it in the kernel would be to return -ERESTARTNOINTR > > > after the zap_page_range_single() instead of jumping back up, which in > > > terms of locking and signal handling and such would be equivalent to > > > looping in userspace (because really that's what -ERESTARTNOINTR does > > > - it returns out to userspace and moves the instruction pointer back > > > to restart the syscall). Though if we do that immediately, it might > > > make MADV_POISON unnecessarily slow, so we should probably retry once > > > before doing that. The other easy way is to just loop here. > > > > Yes we should definitely retry probably a few times to cover the rare > > situation of a THP race as you describe under non-abusive circumstances. > > > > > > > > The cond_resched() and pending fatal signal check mean that (except on > > > CONFIG_PREEMPT_NONE) the only differences between the current > > > implementation and looping in userspace are that we don't handle > > > non-fatal signals in between iterations and that we keep hogging the > > > mmap_lock in read mode. We do already have a bunch of codepaths that > > > retry on concurrent page table changes, like when zap_pte_range() > > > encounters a pte_offset_map_lock() failure; though I guess the > > > difference is that the retry on those is just a couple instructions, > > > which would be harder to race consistently, while here we redo walks > > > across the entire range, which should be fairly easy to race > > > repeatedly. > > > > > > So I guess you have a point that this might be the easiest way to > > > stall other tasks that are trying to take mmap_lock for an extended > > > amount of time, I did not fully consider that... and then I guess you > > > could use that to slow down usercopy fault handling (once the lock > > > switches to handoff mode because of a stalled writer?) or slow down > > > other processes trying to read /proc/$pid/cmdline? > > > > Hm does that need a write lock? > > No, but if you have one reader that is hogging the rwsem, and then a > writer is queued up on the rwsem afterwards, I think new readers will > sometimes be queued up behind the writer. So even though the rwsem is > only actually held by a reader, new readers can't immediately take the > rwsem because the rwsem code thinks that would be unfair to a pending > writer who just wants to make some quick change. I'm not super > familiar with this code, but basically I think it works roughly like > this: > > If the rwsem code notices that a bunch of readers are preventing a > writer from taking the lock, the rwsem code will start queuing new > readers behind the queued writer. You can see in rwsem_read_trylock() > that the trylock fastpath is skipped if anyone is waiting on the rwsem > or the handoff flag is set, and in rwsem_down_read_slowpath() the > "reader optimistic lock stealing" path is skipped if the lock is > currently held by multiple readers or if the handoff bit is set. > > The handoff bit can be set in rwsem_try_write_lock() by a writer "if > it is an RT task or wait in the wait queue for too long". Basically I > think it means something like "I think other users of the lock are > hogging it more than they should, stop stealing the lock from me". > And the RWSEM_WAIT_TIMEOUT for triggering handoff mode is pretty > short, RWSEM_WAIT_TIMEOUT is defined to something like 4ms, so I think > that's how long writers tolerate the lock being hogged by readers > before they prevent new readers from stealing the lock. Ack makes sense! -ERESTARTNOINTR should help resolve this so definitely unless anybody has any objection to that I'll go ahead and do a respin taking that approach (+ all otehr fixes) for v2. Thanks for your input!