On Mon, Jul 29, 2024 at 8:11 PM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote: > > On Thu, Jul 25, 2024 at 11:19:33PM +0300, Amir Goldstein wrote: > > On Thu, Jul 25, 2024 at 9:20 PM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote: > > > > > > FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending > > > on the faulting method. > > > > > > This pre-content event is meant to be used by hierarchical storage > > > managers that want to fill in the file content on first read access. > > > > > > Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx> > > > --- > > > fs/notify/fsnotify.c | 13 +++++++++ > > > include/linux/fsnotify_backend.h | 14 +++++++++ > > > mm/filemap.c | 50 ++++++++++++++++++++++++++++---- > > > 3 files changed, 71 insertions(+), 6 deletions(-) > > > > > > diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c > > > index 1ca4a8da7f29..435232d46b4f 100644 > > > --- a/fs/notify/fsnotify.c > > > +++ b/fs/notify/fsnotify.c > > > @@ -28,6 +28,19 @@ void __fsnotify_vfsmount_delete(struct vfsmount *mnt) > > > fsnotify_clear_marks_by_mount(mnt); > > > } > > > > > > +bool fsnotify_file_has_content_watches(struct file *file) > > > > nit: has_pre_content_watches... > > > > > +{ > > > + struct inode *inode = file_inode(file); > > > + struct super_block *sb = inode->i_sb; > > > + struct mount *mnt = real_mount(file->f_path.mnt); > > > + u32 mask = inode->i_fsnotify_mask; > > > + > > > + mask |= mnt->mnt_fsnotify_mask; > > > + mask |= sb->s_fsnotify_mask; > > > + > > > + return !!(mask & FSNOTIFY_PRE_CONTENT_EVENTS); > > > > This can use the fsnotify_object_watched() helper, and it will need > > the READ_ONCE() that are just being added to avoid data races. > > > > > +} > > > + > > > /** > > > * fsnotify_unmount_inodes - an sb is unmounting. handle any watched inodes. > > > * @sb: superblock being unmounted. > > > diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h > > > index 36c3d18cc40a..6983fbf096b8 100644 > > > --- a/include/linux/fsnotify_backend.h > > > +++ b/include/linux/fsnotify_backend.h > > > @@ -900,6 +900,15 @@ static inline void fsnotify_init_event(struct fsnotify_event *event) > > > INIT_LIST_HEAD(&event->list); > > > } > > > > > > +#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS > > > +bool fsnotify_file_has_content_watches(struct file *file); > > > +#else > > > +static inline bool fsnotify_file_has_content_watches(struct file *file) > > > +{ > > > + return false; > > > +} > > > +#endif /* CONFIG_FANOTIFY_ACCESS_PERMISSIONS */ > > > + > > > #else > > > > > > static inline int fsnotify(__u32 mask, const void *data, int data_type, > > > @@ -938,6 +947,11 @@ static inline u32 fsnotify_get_cookie(void) > > > static inline void fsnotify_unmount_inodes(struct super_block *sb) > > > {} > > > > > > +static inline bool fsnotify_file_has_content_watches(struct file *file) > > > +{ > > > + return false; > > > +} > > > + > > > #endif /* CONFIG_FSNOTIFY */ > > > > > > #endif /* __KERNEL __ */ > > > diff --git a/mm/filemap.c b/mm/filemap.c > > > index ca8c8d889eef..cc9d7885bbe3 100644 > > > --- a/mm/filemap.c > > > +++ b/mm/filemap.c > > > @@ -46,6 +46,7 @@ > > > #include <linux/pipe_fs_i.h> > > > #include <linux/splice.h> > > > #include <linux/rcupdate_wait.h> > > > +#include <linux/fsnotify.h> > > > #include <asm/pgalloc.h> > > > #include <asm/tlbflush.h> > > > #include "internal.h" > > > @@ -3112,13 +3113,13 @@ static int lock_folio_maybe_drop_mmap(struct vm_fault *vmf, struct folio *folio, > > > * that. If we didn't pin a file then we return NULL. The file that is > > > * returned needs to be fput()'ed when we're done with it. > > > */ > > > -static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) > > > +static struct file *do_sync_mmap_readahead(struct vm_fault *vmf, > > > + struct file *fpin) > > > { > > > struct file *file = vmf->vma->vm_file; > > > struct file_ra_state *ra = &file->f_ra; > > > struct address_space *mapping = file->f_mapping; > > > DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff); > > > - struct file *fpin = NULL; > > > unsigned long vm_flags = vmf->vma->vm_flags; > > > unsigned int mmap_miss; > > > > > > @@ -3182,12 +3183,12 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) > > > * was pinned if we have to drop the mmap_lock in order to do IO. > > > */ > > > static struct file *do_async_mmap_readahead(struct vm_fault *vmf, > > > - struct folio *folio) > > > + struct folio *folio, > > > + struct file *fpin) > > > { > > > > If I am reading correctly, iomap (i.e. xfs) write shared memory fault > > does not reach this code? > > > > Do we care about writable shared memory faults use case for HSM? > > It does not sound very relevant to HSM, but we cannot just ignore it.. > > > > Sorry I realized I went off to try and solve this problem and never responded to > you. I'm addressing the other comments, but this one is a little tricky. > > We're kind of stuck between a rock and a hard place with this. I had originally > put this before the ->fault() callback, but purposefully moved it into > filemap_fault() because I want to be able to drop the mmap lock while we're > waiting for a response from the HSM. > > The reason to do this is because there are things that take the mmap lock for > simple things outside of the process, like /proc/$PID/smaps and other related > things, and this can cause high priority tasks to block behind possibly low > priority IO, creating a priority inversion. > > Now, I'm not sure how widespread of a problem this is anymore, I know there's > been work done to the kernel and tools to avoid this style of problem. I'm ok > with a "try it and see" approach, but I don't love that. > I defer this question to Jan. > However I think putting fsnotify hooks into XFS itself for this particular path > is a good choice either. I think you meant "not a good choice" and I agree - it is not only xfs, but could be any fs that will be converted to iomap Other fs have ->fault != filemap_fault, even if they do end up calling filemap_fault, IOW, there is no API guarantee that they will. > What do you think? Just move it to before ->fault(), > leave the mmap lock in place, and be done with it? If Jan blesses the hook called with mmap lock, then yeh, putting the hook in the most generic "vfs" code would be the best choice for maintenance. Thanks, Amir.