On Fri, Jan 31, 2025 at 08:19:22PM -0500, Peter Xu wrote: > On Fri, Jan 31, 2025 at 11:59:56AM -0800, Linus Torvalds wrote: > > On Fri, 31 Jan 2025 at 11:17, Alex Williamson > > <alex.williamson@xxxxxxxxxx> wrote: > > > > > > 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches") > > > > > > This breaks huge_fault support for PFNMAPs that was recently added in > > > v6.12 and is used by vfio-pci to fault device memory using PMD and PUD > > > order mappings. > > > > Surely only for content watches? > > > > Which shouldn't be a valid situation *anyway*. > > > > IOW, there must be some unrelated bug somewhere: either somebody is > > allowed to set a pre-content match on a special device. > > > > That should be disabled by the whole > > > > /* > > * If there are permission event watchers but no pre-content event > > * watchers, set FMODE_NONOTIFY | FMODE_NONOTIFY_PERM to indicate that. > > */ > > > > thing in file_set_fsnotify_mode() which only allows regular files and > > directories to be notified on. > > > > Or, alternatively, that check for huge-fault disabling is just > > checking the wrong bits. > > > > Or - quite possibly - I am missing something obvious? > > Is it possible that we have some paths got overlooked in setting up the > fsnotify bits in f_mode? Meanwhile since the default is "no bit set" on > those bits, I think it means FMODE_FSNOTIFY_HSM() can always return true on > those if overlooked.. > > One thing to mention is, /dev/vfio/* are chardevs, however the PCI bars are > not mmap()ed from these fds - whatever under /dev/vfio/* represents IOMMU > groups rather than the device fd itself. > > The app normally needs to first open the IOMMU group fd under /dev/vfio/*, > then using VFIO ioctl(VFIO_GROUP_GET_DEVICE_FD) to get the device fd, which > will be the mmap() target, instead of the ones under /dev. Ok, but those "device fds" aren't really device fds in the sense that they are character fds. They are regular files afaict from: vfio_device_open_file(struct vfio_device *device) (Well, it's actually worse as anon_inode_getfile() files don't have any mode at all but that's beside the point.)? In any case, I think you're right that such files would (accidently?) qualify for content watches afaict. So at least that should probably get FMODE_NONOTIFY. > > I checked, those device fds were allocated from vfio_device_open_file() > within the ioctl, which internally uses anon_inode_getfile(). I don't see > anywhere in that path that will set the fanotify bits.. > > Further, I'm not sure whether some callers of alloc_file() can also suffer Sidenote, mm/memfd.c should pretty please rename alloc_file() to memfd_alloc_file() or something. That would be great because alloc_file() is a local fs/file_table.c helper and grepping for it is confusing as I first thought someone made alloc_file() available outside of fs/file_table.c > from similar issue, because at least memfd_create() syscall also uses the > API, which (hopefully?) would used to allow THPs for shmem backed memfds on > aligned mmap()s, but not sure whether it'll also wrongly trigger the > FALLBACK path similarly in create_huge_pmd() just like vfio's VMAs. I > didn't verify it though, nor did I yet check more users. > > So I wonder whether we should setup the fanotify bits in at least > alloc_file() too (to FMODE_NONOTIFY?). > > I'm totally not familiar with fanotify, and it's a bit late to try verify > anything (I cannot quickly find my previous huge pfnmap setup, so setup > those will also take time..). but maybe above can provide some clues for > others.. > > Thanks, > > -- > Peter Xu >