On Mon 29-07-24 21:57:34, Amir Goldstein wrote: > On Mon, Jul 29, 2024 at 8:11 PM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote: > > > If I am reading correctly, iomap (i.e. xfs) write shared memory fault > > > does not reach this code? > > > > > > Do we care about writable shared memory faults use case for HSM? > > > It does not sound very relevant to HSM, but we cannot just ignore it.. > > > > > > > Sorry I realized I went off to try and solve this problem and never responded to > > you. I'm addressing the other comments, but this one is a little tricky. > > > > We're kind of stuck between a rock and a hard place with this. I had originally > > put this before the ->fault() callback, but purposefully moved it into > > filemap_fault() because I want to be able to drop the mmap lock while we're > > waiting for a response from the HSM. > > > > The reason to do this is because there are things that take the mmap lock for > > simple things outside of the process, like /proc/$PID/smaps and other related > > things, and this can cause high priority tasks to block behind possibly low > > priority IO, creating a priority inversion. > > > > Now, I'm not sure how widespread of a problem this is anymore, I know there's > > been work done to the kernel and tools to avoid this style of problem. I'm ok > > with a "try it and see" approach, but I don't love that. > > > > I defer this question to Jan. > > > However I think putting fsnotify hooks into XFS itself for this particular path > > is a good choice either. > > I think you meant "not a good choice" and I agree - > it is not only xfs, but could be any fs that will be converted to iomap > Other fs have ->fault != filemap_fault, even if they do end up calling > filemap_fault, IOW, there is no API guarantee that they will. > > > What do you think? Just move it to before ->fault(), > > leave the mmap lock in place, and be done with it? > > If Jan blesses the hook called with mmap lock, then yeh, > putting the hook in the most generic "vfs" code would be > the best choice for maintenance. Well, I agree with Josef's comment about a rock and a hard place. For once, regardless whether the hook will happen from before ->fault or from inside the ->fault handler there will be fault callers where we cannot drop mmap_lock (not all archs support dropping mmap_lock inside a fault AFAIR - but a quick grep seems to show that these days maybe they do, also some callers - most notably through GUP - don't allow dropping of mmap_lock inside fault). So we have to have a way to handle a fault without FAULT_FLAG_ALLOW_RETRY flag. Now of course waiting for userspace reply to fanotify event with mmap_lock held is ... dangerous. For example consider application with two threads: T1 T2 page fault on inode I write to inode I lock mm->mmap_lock inode_lock(I) send fanotify event ... fault_in_iov_iter_readable() lock mm->mmap_lock -> blocks behind T1 now the HSM handler needs to fill in contents of inode I requested by the page fault: inode_lock(I) -> deadlock So conceptually I think the flow could look like (in __do_fault): if (!(vmf->flags & FAULT_FLAG_TRIED) && fsnotify_may_send_pre_content_event()) { if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) return VM_FAULT_RETRY; fpin = maybe_unlock_mmap_for_io(vmf, NULL); if (!fpin) return ???VM_FAULT_SIGSEGV???; err = fsnotify_fault(...); if (err) return VM_FAULT_SIGBUS | VM_FAULT_RETRY; /* * We are fine with proceeding with the fault. Retry the fault * to let the filesystem handle it. */ return VM_FAULT_RETRY; } The downside is that if we enter the page fault without ability to drop mmap_lock on a file needing HSM handling, we get SIGSEGV. I'm not sure it matters in practice because these are not that common paths e.g. stuff like putting a breakpoint / uprobe on a file but maybe there are some surprises. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR