Re: [PATCH 10/10] fsnotify: generate pre-content permission event on page fault

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue 30-07-24 12:51:49, Josef Bacik wrote:
> On Tue, Jul 30, 2024 at 02:18:37PM +0200, Jan Kara wrote:
> > On Mon 29-07-24 21:57:34, Amir Goldstein wrote:
> > > On Mon, Jul 29, 2024 at 8:11 PM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote:
> > > > > If I am reading correctly, iomap (i.e. xfs) write shared memory fault
> > > > > does not reach this code?
> > > > >
> > > > > Do we care about writable shared memory faults use case for HSM?
> > > > > It does not sound very relevant to HSM, but we cannot just ignore it..
> > > > >
> > > >
> > > > Sorry I realized I went off to try and solve this problem and never responded to
> > > > you.  I'm addressing the other comments, but this one is a little tricky.
> > > >
> > > > We're kind of stuck between a rock and a hard place with this.  I had originally
> > > > put this before the ->fault() callback, but purposefully moved it into
> > > > filemap_fault() because I want to be able to drop the mmap lock while we're
> > > > waiting for a response from the HSM.
> > > >
> > > > The reason to do this is because there are things that take the mmap lock for
> > > > simple things outside of the process, like /proc/$PID/smaps and other related
> > > > things, and this can cause high priority tasks to block behind possibly low
> > > > priority IO, creating a priority inversion.
> > > >
> > > > Now, I'm not sure how widespread of a problem this is anymore, I know there's
> > > > been work done to the kernel and tools to avoid this style of problem.  I'm ok
> > > > with a "try it and see" approach, but I don't love that.
> > > >
> > > 
> > > I defer this question to Jan.
> > > 
> > > > However I think putting fsnotify hooks into XFS itself for this particular path
> > > > is a good choice either.
> > > 
> > > I think you meant "not a good choice" and I agree -
> > > it is not only xfs, but could be any fs that will be converted to iomap
> > > Other fs have ->fault != filemap_fault, even if they do end up calling
> > > filemap_fault, IOW, there is no API guarantee that they will.
> > > 
> > > > What do you think?  Just move it to before ->fault(),
> > > > leave the mmap lock in place, and be done with it?
> > > 
> > > If Jan blesses the hook called with mmap lock, then yeh,
> > > putting the hook in the most generic "vfs" code would be
> > > the best choice for maintenance.
> > 
> > Well, I agree with Josef's comment about a rock and a hard place. For once,
> > regardless whether the hook will happen from before ->fault or from inside
> > the ->fault handler there will be fault callers where we cannot drop
> > mmap_lock (not all archs support dropping mmap_lock inside a fault AFAIR -
> > but a quick grep seems to show that these days maybe they do, also some
> > callers - most notably through GUP - don't allow dropping of mmap_lock
> > inside fault). So we have to have a way to handle a fault without
> > FAULT_FLAG_ALLOW_RETRY flag.
> > 
> > Now of course waiting for userspace reply to fanotify event with mmap_lock
> > held is ... dangerous. For example consider application with two threads:
> > 
> > T1					T2
> > page fault on inode I			write to inode I
> >   lock mm->mmap_lock			  inode_lock(I)
> >     send fanotify event			  ...
> > 					  fault_in_iov_iter_readable()
> > 					    lock mm->mmap_lock -> blocks
> > 					      behind T1
> > 
> > now the HSM handler needs to fill in contents of inode I requested by the
> > page fault:
> > 
> >   inode_lock(I) -> deadlock
> > 
> > So conceptually I think the flow could look like (in __do_fault):
> > 
> > 	if (!(vmf->flags & FAULT_FLAG_TRIED) &&
> > 	    fsnotify_may_send_pre_content_event()) {
> > 		if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> > 			return VM_FAULT_RETRY;
> > 		fpin = maybe_unlock_mmap_for_io(vmf, NULL);
> > 		if (!fpin)
> > 			return ???VM_FAULT_SIGSEGV???;
> > 		err = fsnotify_fault(...);
> > 		if (err)
> > 			return VM_FAULT_SIGBUS | VM_FAULT_RETRY;
> > 		/*
> > 		 * We are fine with proceeding with the fault. Retry the fault
> > 		 * to let the filesystem handle it.
> > 		 */
> > 		return VM_FAULT_RETRY;
> > 	}
> > 
> > The downside is that if we enter the page fault without ability to drop
> > mmap_lock on a file needing HSM handling, we get SIGSEGV. I'm not sure it
> > matters in practice because these are not that common paths e.g. stuff like
> > putting a breakpoint / uprobe on a file but maybe there are some surprises.
> > 
> 
> The only thing I don't like about this is that now the fault handler
> loses the ability to drop the mmap sem.  I think in practice this doesn't
> really matter, because we're trying to avoid doing IO while under the
> mmap sem, and presumably this will have instantiated the pages to avoid
> the IO.
> 
> However if you use reflink this wouldn't be the case, and now we've
> re-introduced the priority inversion possiblity.

Hum, you're right. I was focused on handling the case when HSM needs to
pull in the page but the common case when the page is local, just not in
cache, is also very important.

> I'm leaning more towards just putting the fsnotify hook in the xfs code
> for the write case, and anybody else who implements their own ->fault
> without calling the generic one will just have to do the same.
> 
> That being said I'm not going to die on any particular hill here, if you
> still want to do the above before the ->fault() handler then I'll update
> my code to do that and we can move on.  Thanks,

I think for now issuing the pre content event from ->fault is OK. I'm not
thrilled about it (since it ultimately means duplication among page fault
handlers) but I don't see a cleaner solution that would perform reasonably
well.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux