Re: [PATCH 10/10] fsnotify: generate pre-content permission event on page fault

Jan Kara <jack@xxxxxxx> · Tue, 30 Jul 2024 14:18:37 +0200

On Mon 29-07-24 21:57:34, Amir Goldstein wrote:
> On Mon, Jul 29, 2024 at 8:11 PM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote:
> > > If I am reading correctly, iomap (i.e. xfs) write shared memory fault
> > > does not reach this code?
> > >
> > > Do we care about writable shared memory faults use case for HSM?
> > > It does not sound very relevant to HSM, but we cannot just ignore it..
> > >
> >
> > Sorry I realized I went off to try and solve this problem and never responded to
> > you.  I'm addressing the other comments, but this one is a little tricky.
> >
> > We're kind of stuck between a rock and a hard place with this.  I had originally
> > put this before the ->fault() callback, but purposefully moved it into
> > filemap_fault() because I want to be able to drop the mmap lock while we're
> > waiting for a response from the HSM.
> >
> > The reason to do this is because there are things that take the mmap lock for
> > simple things outside of the process, like /proc/$PID/smaps and other related
> > things, and this can cause high priority tasks to block behind possibly low
> > priority IO, creating a priority inversion.
> >
> > Now, I'm not sure how widespread of a problem this is anymore, I know there's
> > been work done to the kernel and tools to avoid this style of problem.  I'm ok
> > with a "try it and see" approach, but I don't love that.
> >
> 
> I defer this question to Jan.
> 
> > However I think putting fsnotify hooks into XFS itself for this particular path
> > is a good choice either.
> 
> I think you meant "not a good choice" and I agree -
> it is not only xfs, but could be any fs that will be converted to iomap
> Other fs have ->fault != filemap_fault, even if they do end up calling
> filemap_fault, IOW, there is no API guarantee that they will.
> 
> > What do you think?  Just move it to before ->fault(),
> > leave the mmap lock in place, and be done with it?
> 
> If Jan blesses the hook called with mmap lock, then yeh,
> putting the hook in the most generic "vfs" code would be
> the best choice for maintenance.

Well, I agree with Josef's comment about a rock and a hard place. For once,
regardless whether the hook will happen from before ->fault or from inside
the ->fault handler there will be fault callers where we cannot drop
mmap_lock (not all archs support dropping mmap_lock inside a fault AFAIR -
but a quick grep seems to show that these days maybe they do, also some
callers - most notably through GUP - don't allow dropping of mmap_lock
inside fault). So we have to have a way to handle a fault without
FAULT_FLAG_ALLOW_RETRY flag.

Now of course waiting for userspace reply to fanotify event with mmap_lock
held is ... dangerous. For example consider application with two threads:

T1					T2
page fault on inode I			write to inode I
  lock mm->mmap_lock			  inode_lock(I)
    send fanotify event			  ...
					  fault_in_iov_iter_readable()
					    lock mm->mmap_lock -> blocks
					      behind T1

now the HSM handler needs to fill in contents of inode I requested by the
page fault:

  inode_lock(I) -> deadlock

So conceptually I think the flow could look like (in __do_fault):

	if (!(vmf->flags & FAULT_FLAG_TRIED) &&
	    fsnotify_may_send_pre_content_event()) {
		if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
			return VM_FAULT_RETRY;
		fpin = maybe_unlock_mmap_for_io(vmf, NULL);
		if (!fpin)
			return ???VM_FAULT_SIGSEGV???;
		err = fsnotify_fault(...);
		if (err)
			return VM_FAULT_SIGBUS | VM_FAULT_RETRY;
		/*
		 * We are fine with proceeding with the fault. Retry the fault
		 * to let the filesystem handle it.
		 */
		return VM_FAULT_RETRY;
	}

The downside is that if we enter the page fault without ability to drop
mmap_lock on a file needing HSM handling, we get SIGSEGV. I'm not sure it
matters in practice because these are not that common paths e.g. stuff like
putting a breakpoint / uprobe on a file but maybe there are some surprises.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR