David Hildenbrand <david@xxxxxxxxxx> writes: > On 07.07.22 21:03, Alex Sierra wrote: >> From: Alistair Popple <apopple@xxxxxxxxxx> >> >> Currently any attempts to pin a device coherent page will fail. This is >> because device coherent pages need to be managed by a device driver, and >> pinning them would prevent a driver from migrating them off the device. >> >> However this is no reason to fail pinning of these pages. These are >> coherent and accessible from the CPU so can be migrated just like >> pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin >> them first try migrating them out of ZONE_DEVICE. >> >> Signed-off-by: Alistair Popple <apopple@xxxxxxxxxx> >> Acked-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> >> [hch: rebased to the split device memory checks, >> moved migrate_device_page to migrate_device.c] >> Signed-off-by: Christoph Hellwig <hch@xxxxxx> >> --- >> mm/gup.c | 47 +++++++++++++++++++++++++++++++++++----- >> mm/internal.h | 1 + >> mm/migrate_device.c | 53 +++++++++++++++++++++++++++++++++++++++++++++ >> 3 files changed, 96 insertions(+), 5 deletions(-) >> >> diff --git a/mm/gup.c b/mm/gup.c >> index b65fe8bf5af4..9b6b9923d22d 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -1891,9 +1891,43 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, >> continue; >> prev_folio = folio; >> >> - if (folio_is_longterm_pinnable(folio)) >> + /* >> + * Device private pages will get faulted in during gup so it >> + * shouldn't be possible to see one here. >> + */ >> + if (WARN_ON_ONCE(folio_is_device_private(folio))) { >> + ret = -EFAULT; >> + goto unpin_pages; >> + } > > I'd just drop that. Device private pages are never part of a present PTE. So if we > could actually get a grab of one via GUP we would be in bigger trouble ... > already before this patch. Fair. >> + >> + /* >> + * Device coherent pages are managed by a driver and should not >> + * be pinned indefinitely as it prevents the driver moving the >> + * page. So when trying to pin with FOLL_LONGTERM instead try >> + * to migrate the page out of device memory. >> + */ >> + if (folio_is_device_coherent(folio)) { >> + WARN_ON_ONCE(PageCompound(&folio->page)); > > Maybe that belongs into migrate_device_page()? Ok (noting Matthew's comment there as well). >> + >> + /* >> + * Migration will fail if the page is pinned, so convert > > [...] > >> /* >> * mm/gup.c >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >> index cf9668376c5a..5decd26dd551 100644 >> --- a/mm/migrate_device.c >> +++ b/mm/migrate_device.c >> @@ -794,3 +794,56 @@ void migrate_vma_finalize(struct migrate_vma *migrate) >> } >> } >> EXPORT_SYMBOL(migrate_vma_finalize); >> + >> +/* >> + * Migrate a device coherent page back to normal memory. The caller should have >> + * a reference on page which will be copied to the new page if migration is >> + * successful or dropped on failure. >> + */ >> +struct page *migrate_device_page(struct page *page, unsigned int gup_flags) > > Function name should most probably indicate that we're dealing with coherent pages here? Ok. >> +{ >> + unsigned long src_pfn, dst_pfn = 0; >> + struct migrate_vma args; >> + struct page *dpage; >> + >> + lock_page(page); >> + src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE; >> + args.src = &src_pfn; >> + args.dst = &dst_pfn; >> + args.cpages = 1; >> + args.npages = 1; >> + args.vma = NULL; >> + migrate_vma_setup(&args); >> + if (!(src_pfn & MIGRATE_PFN_MIGRATE)) >> + return NULL; > > Wow, these refcount and page locking/unlocking rules with this migrate_* api are > confusing now. And the usage here of sometimes returning and sometimes falling > trough don't make it particularly easier to understand here. > > I'm not 100% happy about reusing migrate_vma_setup() usage if there *is no VMA*. > That's just absolutely confusing, because usually migrate_vma_setup() itself > would do the collection step and ref+lock pages. :/ > > In general, I can see why/how we're reusing the migrate_vma_* API here, but there > is absolutely no VMA ... not sure what to improve besides providing a second API > that does a simple single-page migration. But that can be changed later ... Yeah, as noted in your other response I think it should be ok to just call migrate_vma_unmap() directly from migrate_device_page() so I assume that would adequately deal with this. >> + >> + dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0); >> + > > alloc_page() > >> + /* >> + * get/pin the new page now so we don't have to retry gup after >> + * migrating. We already have a reference so this should never fail. >> + */ >> + if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) { >> + __free_pages(dpage, 0); > > __free_page() > >> + dpage = NULL; >> + } > > Hm, this means that we are not pinning via the PTE at hand, but via something > we expect migration to put into the PTE. I'm not really happy about this. > > Ideally, we'd make the pinning decision only on the actual GUP path, not in here. > Just like in the migrate_pages() case, where we end up dropping all refs/pins > and looking up again via GUP from the PTE. > > For example, I wonder if something nasty could happen if the PTE got mapped > R/O in the meantime and you're pinning R/W here ... > > TBH, all this special casing on gup_flags here is nasty. Please, let's just do > it like migrate_pages() and do another GUP walk. Absolutely no need to optimize. The only reason to pass gup_flags is to check FOLL_PIN vs. FOLL_GET so that we can do the right reference on the destination page. I did the optimisation because we already have the destination page with a reference and GUP/PUP does not make any guarantees about the current PTE state anyway. However I noticed there might be a race here - during migration we replace present PTEs with migration entries. On fork these get copied via copy_nonpresent_pte() and made read-only. However we don't check if the page a migration entry points to is pinned or not. For an ordinary PTE copy_present_pte() would copy the page for a COW mapping, but this won't happen if the page happens to be undergoing migration (even though the migration will ultimately fail due to the pin). Anyway I don't think this patch currently makes that any worse, but if we fix the above it will because there is a brief period during which the page we're pinning won't look like a pinned page. So I will go with the suggestion to do another GUP walk. > [...] > > > > I'd go with something like the following on top (which does not touch on the > general semantic issue with migrate_vma_* ). Note that I most probably messed > up some refcount/lock handling and that it's broken. > Just to give you an idea what I think could be cleaner. Thanks! At a glance it looks roughly right but I will check and respin it to incorporate the comments. > diff --git a/mm/gup.c b/mm/gup.c > index 9b6b9923d22d..17041b3e605e 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -1881,7 +1881,7 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, > unsigned long isolation_error_count = 0, i; > struct folio *prev_folio = NULL; > LIST_HEAD(movable_page_list); > - bool drain_allow = true; > + bool drain_allow = true, any_device_coherent = false; > int ret = 0; > > for (i = 0; i < nr_pages; i++) { > @@ -1891,15 +1891,6 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, > continue; > prev_folio = folio; > > - /* > - * Device private pages will get faulted in during gup so it > - * shouldn't be possible to see one here. > - */ > - if (WARN_ON_ONCE(folio_is_device_private(folio))) { > - ret = -EFAULT; > - goto unpin_pages; > - } > - > /* > * Device coherent pages are managed by a driver and should not > * be pinned indefinitely as it prevents the driver moving the > @@ -1907,7 +1898,12 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, > * to migrate the page out of device memory. > */ > if (folio_is_device_coherent(folio)) { > - WARN_ON_ONCE(PageCompound(&folio->page)); > + /* > + * We always want a new GUP lookup with device coherent > + * pages. > + */ > + any_device_coherent = true; > + pages[i] = 0; > > /* > * Migration will fail if the page is pinned, so convert > @@ -1918,11 +1914,12 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, > unpin_user_page(&folio->page); > } > > - pages[i] = migrate_device_page(&folio->page, gup_flags); > - if (!pages[i]) { > - ret = -EBUSY; > + ret = migrate_device_coherent_page(&folio->page); > + if (ret) > goto unpin_pages; > - } > + /* The reference to our folio is stale now. */ > + prev_folio = NULL; > + folio = NULL; > continue; > } > > @@ -1953,7 +1950,8 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, > folio_nr_pages(folio)); > } > > - if (!list_empty(&movable_page_list) || isolation_error_count) > + if (!list_empty(&movable_page_list) || isolation_error_count || > + any_device_coherent) > goto unpin_pages; > > /* > @@ -1963,14 +1961,19 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, > return nr_pages; > > unpin_pages: > - for (i = 0; i < nr_pages; i++) { > - if (!pages[i]) > - continue; > + /* We have to be careful if we stumbled over device coherent pages. */ > + if (unlikely(any_device_coherent || !(gup_flags & FOLL_PIN))) { > + for (i = 0; i < nr_pages; i++) { > + if (!pages[i]) > + continue; > > - if (gup_flags & FOLL_PIN) > - unpin_user_page(pages[i]); > - else > - put_page(pages[i]); > + if (gup_flags & FOLL_PIN) > + unpin_user_page(pages[i]); > + else > + put_page(pages[i]); > + } > + } else { > + unpin_user_pages(pages, nr_pages); > } > > if (!list_empty(&movable_page_list)) { > diff --git a/mm/internal.h b/mm/internal.h > index eeab4ee7a4a3..899dab512c5a 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -853,7 +853,7 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, > unsigned long addr, int page_nid, int *flags); > > void free_zone_device_page(struct page *page); > -struct page *migrate_device_page(struct page *page, unsigned int gup_flags); > +int migrate_device_coherent_page(struct page *page); > > /* > * mm/gup.c > diff --git a/mm/migrate_device.c b/mm/migrate_device.c > index 5decd26dd551..dfb78ea3d326 100644 > --- a/mm/migrate_device.c > +++ b/mm/migrate_device.c > @@ -797,53 +797,40 @@ EXPORT_SYMBOL(migrate_vma_finalize); > > /* > * Migrate a device coherent page back to normal memory. The caller should have > - * a reference on page which will be copied to the new page if migration is > - * successful or dropped on failure. > + * a reference on page, which will be dropped on return. > */ > -struct page *migrate_device_page(struct page *page, unsigned int gup_flags) > +int migrate_device_coherent_page(struct page *page) > { > unsigned long src_pfn, dst_pfn = 0; > - struct migrate_vma args; > + struct migrate_vma args = { > + .src = &src_pfn, > + .dst = &dst_pfn, > + .cpages = 1, > + .npages = 1, > + .vma = NULL, > + }; > struct page *dpage; > > + VM_WARN_ON_ONCE(PageCompound(page)); > + > lock_page(page); > src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE; > - args.src = &src_pfn; > - args.dst = &dst_pfn; > - args.cpages = 1; > - args.npages = 1; > - args.vma = NULL; > - migrate_vma_setup(&args); > - if (!(src_pfn & MIGRATE_PFN_MIGRATE)) > - return NULL; > - > - dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0); > - > - /* > - * get/pin the new page now so we don't have to retry gup after > - * migrating. We already have a reference so this should never fail. > - */ > - if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) { > - __free_pages(dpage, 0); > - dpage = NULL; > - } > > - if (dpage) { > - lock_page(dpage); > - dst_pfn = migrate_pfn(page_to_pfn(dpage)); > + migrate_vma_setup(&args); > + if (src_pfn & MIGRATE_PFN_MIGRATE) { > + dpage = alloc_page(GFP_USER | __GFP_NOWARN); > + if (dpage) { > + dst_pfn = migrate_pfn(page_to_pfn(dpage)); > + lock_page(dpage); > + } > } > > migrate_vma_pages(&args); > if (src_pfn & MIGRATE_PFN_MIGRATE) > copy_highpage(dpage, page); > migrate_vma_finalize(&args); > - if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) { > - if (gup_flags & FOLL_PIN) > - unpin_user_page(dpage); > - else > - put_page(dpage); > - dpage = NULL; > - } > > - return dpage; > + if (src_pfn & MIGRATE_PFN_MIGRATE) > + return 0; > + return -EBUSY; > } > -- > 2.35.3