On Fri 02-08-24 12:03:57, Josef Bacik wrote: > On Thu, Aug 01, 2024 at 11:40:25PM +0200, Jan Kara wrote: > > On Thu 25-07-24 14:19:47, Josef Bacik wrote: > > > FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending > > > on the faulting method. > > > > > > This pre-content event is meant to be used by hierarchical storage > > > managers that want to fill in the file content on first read access. > > > > > > Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx> > > ... > > > @@ -3287,6 +3288,35 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) > > > if (unlikely(index >= max_idx)) > > > return VM_FAULT_SIGBUS; > > > > > > + /* > > > + * If we have pre-content watchers then we need to generate events on > > > + * page fault so that we can populate any data before the fault. > > > + * > > > + * We only do this on the first pass through, otherwise the populating > > > + * application could potentially deadlock on the mmap lock if it tries > > > + * to populate it with mmap. > > > + */ > > > + if (fault_flag_allow_retry_first(vmf->flags) && > > > + fsnotify_file_has_content_watches(file)) { > > > > I'm somewhat nervous that if ALLOW_RETRY isn't set, we'd silently jump into > > readpage code without ever sending pre-content event and thus we'd possibly > > expose invalid content to userspace? I think we should fail the fault if > > fsnotify_file_has_content_watches(file) && !(vmf->flags & > > FAULT_FLAG_ALLOW_RETRY). > > I was worried about this too but it seems to not be the case that we'll not ever > have ALLOW_RETRY. That being said I'm fine turning this into a sigbus. Do you mean that with your workloads we always have ALLOW_RETRY set? As I wrote, currently you'd have to try really hard to hit such paths but they are there - for example if you place uprobe on an address in a VMA that is not present, the page fault is going to happen without ALLOW_RETRY set. > > > + int mask = (vmf->flags & FAULT_FLAG_WRITE) ? MAY_WRITE : MAY_READ; > > > + loff_t pos = vmf->pgoff << PAGE_SHIFT; > > > + > > > + fpin = maybe_unlock_mmap_for_io(vmf, fpin); > > > + > > > + /* > > > + * We can only emit the event if we did actually release the > > > + * mmap lock. > > > + */ > > > + if (fpin) { > > > + error = fsnotify_file_area_perm(fpin, mask, &pos, > > > + PAGE_SIZE); > > > + if (error) { > > > + fput(fpin); > > > + return VM_FAULT_ERROR; > > > + } > > > + } > > > + } > > > + > > > /* > > > * Do we have something in the page cache already? > > > */ > > ... > > > @@ -3612,6 +3643,13 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, > > > unsigned long rss = 0; > > > unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type; > > > > > > + /* > > > + * We are under RCU, we can't emit events here, we need to force a > > > + * normal fault to make sure the events get sent. > > > + */ > > > + if (fsnotify_file_has_content_watches(file)) > > > + return ret; > > > + > > > > I don't think we need to do anything for filemap_map_pages(). The call just > > inserts page cache content into page tables and whatever is in the page > > cache and has folio_uptodate() set should be already valid file content, > > shouldn't it? > > I'll make this comment more clear. filemap_fault() will start readahead, > but we'll only emit the event for the page size that we're faulting. I > had looked at putting this at the readahead place and figuring out the > readahead size, but literally anything could trigger readahead so it's > better to just not allow filemap_map_pages() to happen, otherwise we'll > end up with empty pages (if the content hasn't been populated yet) and > never emit an event for those ranges. This seems like an interesting problem. Even ordinary read(2) will trigger readahead and as you say, we would be instantiating folios with wrong content (zeros) due to that. It seems as a fragile design to keep such folios in the page cache and place checks in all the places that could possibly make their content visible to the user. I'd rather make sure that if we pull folios into page cache (and set folio_uptodate() bit), their content is indeed valid. What we could do is to turn off readahead on the inode if fsnotify_file_has_content_watches() is true. Essentially the handler of the precontent event can do a much better job of prefilling the page cache with whatever content is needed in a range that makes sense. And then we could leave filemap_map_pages() intact. What do you think guys? Honza > Thanks, > > Josef > -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR