On Tue, Oct 22, 2024 at 09:08:53PM +0200, Jann Horn wrote: > On Mon, Oct 21, 2024 at 10:46 PM Vlastimil Babka <vbabka@xxxxxxx> wrote: > > On 10/21/24 22:27, Lorenzo Stoakes wrote: > > > On Mon, Oct 21, 2024 at 10:11:29PM +0200, Vlastimil Babka wrote: > > >> On 10/20/24 18:20, Lorenzo Stoakes wrote: > > >> > + while (true) { > > >> > + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ > > >> > + err = walk_page_range_mm(vma->vm_mm, start, end, > > >> > + &guard_poison_walk_ops, NULL); > > >> > + if (err <= 0) > > >> > + return err; > > >> > + > > >> > + /* > > >> > + * OK some of the range have non-guard pages mapped, zap > > >> > + * them. This leaves existing guard pages in place. > > >> > + */ > > >> > + zap_page_range_single(vma, start, end - start, NULL); > > >> > > >> ... however the potentially endless loop doesn't seem great. Could a > > >> malicious program keep refaulting the range (ignoring any segfaults if it > > >> loses a race) with one thread while failing to make progress here with > > >> another thread? Is that ok because it would only punish itself? > > > > > > Sigh. Again, I don't think you've read the previous series have you? Or > > > even the changelog... I added this as Jann asked for it. Originally we'd > > > -EAGAIN if we got raced. See the discussion over in v1 for details. > > > > > > I did it that way specifically to avoid such things, but Jann didn't appear > > > to think it was a problem. > > > > If Jann is fine with this then it must be secure enough. > > My thinking there was: > > We can legitimately race with adjacent faults populating the area > we're operating on with THP pages; as long as the zapping and > poison-marker-setting are separate, *someone* will have to do the > retry. Either we do it in the kernel, or we tell userspace to handle > it, but having the kernel take care of it is preferable because it > makes the stable UAPI less messy. > > One easy way to do it in the kernel would be to return -ERESTARTNOINTR > after the zap_page_range_single() instead of jumping back up, which in > terms of locking and signal handling and such would be equivalent to > looping in userspace (because really that's what -ERESTARTNOINTR does > - it returns out to userspace and moves the instruction pointer back > to restart the syscall). Though if we do that immediately, it might > make MADV_POISON unnecessarily slow, so we should probably retry once > before doing that. The other easy way is to just loop here. Yes we should definitely retry probably a few times to cover the rare situation of a THP race as you describe under non-abusive circumstances. > > The cond_resched() and pending fatal signal check mean that (except on > CONFIG_PREEMPT_NONE) the only differences between the current > implementation and looping in userspace are that we don't handle > non-fatal signals in between iterations and that we keep hogging the > mmap_lock in read mode. We do already have a bunch of codepaths that > retry on concurrent page table changes, like when zap_pte_range() > encounters a pte_offset_map_lock() failure; though I guess the > difference is that the retry on those is just a couple instructions, > which would be harder to race consistently, while here we redo walks > across the entire range, which should be fairly easy to race > repeatedly. > > So I guess you have a point that this might be the easiest way to > stall other tasks that are trying to take mmap_lock for an extended > amount of time, I did not fully consider that... and then I guess you > could use that to slow down usercopy fault handling (once the lock > switches to handoff mode because of a stalled writer?) or slow down > other processes trying to read /proc/$pid/cmdline? Hm does that need a write lock? > > You can already indefinitely hog the mmap_lock with FUSE, though that > requires that you can mount a FUSE filesystem (which you wouldn't be > able in reasonably sandboxed code) and that you can find something > like a pin_user_pages() call that can't drop the mmap lock in between, > and there aren't actually that many of those... > > So I guess you have a point and the -ERESTARTNOINTR approach would be > a little bit nicer, as long as it's easy to implement. I can go ahead and do it that way if nobody objects, with a few loops before we do it... which hopefully covers off all the concerns? Thanks