On Thu, 23 Jun 2022 20:07:14 +0200 David Hildenbrand <david@xxxxxxxxxx> wrote: > On 15.06.22 17:56, Jason Gunthorpe wrote: > > On Sat, Jun 11, 2022 at 08:29:47PM +0200, David Hildenbrand wrote: > >> On 11.06.22 00:35, Alex Williamson wrote: > >>> The commit referenced below subtly and inadvertently changed the logic > >>> to disallow pinning of zero pfns. This breaks device assignment with > >>> vfio and potentially various other users of gup. Exclude the zero page > >>> test from the negation. > >> > >> I wonder which setups can reliably work with a long-term pin on a shared > >> zeropage. In a MAP_PRIVATE mapping, any write access via the page tables > >> will end up replacing the shared zeropage with an anonymous page. > >> Something similar should apply in MAP_SHARED mappings, when lazily > >> allocating disk blocks. > > ^ correction, shared zeropage is never user in MAP_SHARED mappings > (fortunally). > > >> > >> In the future, we might trigger unsharing when taking a R/O pin for the > >> shared zeropage, just like we do as of now upstream for shared anonymous > >> pages (!PageAnonExclusive). And something similar could then be done > >> when finding a !anon page in a MAP_SHARED mapping. > > > > I'm also confused how qemu is hitting this and it isn't already a bug? > > > > I assume it's just some random thingy mapped into the guest physical > address space (by the bios? R/O?), that actually never ends up getting > used by a device. > > So vfio simply only needs this to keep working ... but weon't actually > ever user that data. > > But this is just my best guess after thinking about it. Good guess. > > It is arising because vfio doesn't use FOLL_FORCE|FOLL_WRITE to move > > away the zero page in most cases. > > > > And why does Yishai say it causes an infinite loop in the kernel? > > > Good question. Maybe $something keeps retying if pinning fails, either > in the kernel (which would be bad) or in user space. At least QEMU seems > to just fail if pinning fails, but maybe it's a different user space? The loop is in __gup_longterm_locked(): do { rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL, gup_flags); if (rc <= 0) break; rc = check_and_migrate_movable_pages(rc, pages, gup_flags); } while (!rc); It appears we're pinning a 32 page (128K) range, __get_user_pages_locked() returns 32, but check_and_migrate_movable_pages() perpetually returns zero. I believe this is because folio_is_pinnable() previously returned true, and now returns false. Therefore we drop down to fail at folio_isolate_lru(), incrementing isolation_error_count. From there we do nothing more than unpin the pages, return zero, and hope for better luck next time, which obviously doesn't happen. If I generate an errno here, QEMU reports failing on the pc.rom memory region at 0xc0000. Thanks, Alex