Hi David, > > For drivers that would like to longterm-pin the pages associated > > with a file, the pin_user_pages_fd() API provides an option to > > not only FOLL_PIN the pages but also to check and migrate them > > if they reside in movable zone or CMA block. For now, this API > > can only work with files belonging to shmem or hugetlbfs given > > that the udmabuf driver is the only user. > > Maybe add "Other files are rejected.". Wasn't clear to me before I > looked into the code. Ok, will add it in v2. > > > > > It must be noted that the pages associated with hugetlbfs files > > are expected to be found in the page cache. An error is returned > > if they are not found. However, shmem pages can be swapped in or > > allocated if they are not present in the page cache. > > > > Cc: David Hildenbrand <david@xxxxxxxxxx> > > Cc: Daniel Vetter <daniel.vetter@xxxxxxxx> > > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > > Cc: Hugh Dickins <hughd@xxxxxxxxxx> > > Cc: Peter Xu <peterx@xxxxxxxxxx> > > Cc: Gerd Hoffmann <kraxel@xxxxxxxxxx> > > Cc: Dongwon Kim <dongwon.kim@xxxxxxxxx> > > Cc: Junxiao Chang <junxiao.chang@xxxxxxxxx> > > Suggested-by: Jason Gunthorpe <jgg@xxxxxxxxxx> > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@xxxxxxxxx> > > --- > > include/linux/mm.h | 2 ++ > > mm/gup.c | 87 > ++++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 89 insertions(+) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index bf5d0b1b16f4..af2121fb8101 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -2457,6 +2457,8 @@ long get_user_pages_unlocked(unsigned long > start, unsigned long nr_pages, > > struct page **pages, unsigned int gup_flags); > > long pin_user_pages_unlocked(unsigned long start, unsigned long > nr_pages, > > struct page **pages, unsigned int gup_flags); > > +long pin_user_pages_fd(int fd, pgoff_t start, unsigned long nr_pages, > > + unsigned int gup_flags, struct page **pages); > > > > int get_user_pages_fast(unsigned long start, int nr_pages, > > unsigned int gup_flags, struct page **pages); > > diff --git a/mm/gup.c b/mm/gup.c > > index 2f8a2d89fde1..e34b77a15fa8 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -3400,3 +3400,90 @@ long pin_user_pages_unlocked(unsigned long > start, unsigned long nr_pages, > > &locked, gup_flags); > > } > > EXPORT_SYMBOL(pin_user_pages_unlocked); > > + > > This does look quite neat, nice! Let's take a closer look ... > > > +/** > > + * pin_user_pages_fd() - pin user pages associated with a file > > + * @fd: the fd whose pages are to be pinned > > + * @start: starting file offset > > + * @nr_pages: number of pages from start to pin > > + * @gup_flags: flags modifying pin behaviour > > ^ I assume we should drop that. At least for now the flags are > completely unused. And most likely we would want a different set of > flags later (GUPFD_ ...). Right now, FOLL_LONGTERM is the only accepted value for gup_flags but yes, as you suggest, this can be made implicit by dropping gup_flags. > > > + * @pages: array that receives pointers to the pages pinned. > > + * Should be at least nr_pages long. > > + * > > + * Attempt to pin (and migrate) pages associated with a file belonging to > > I'd drop the "and migrate" part, it's more of an implementation detail. > > > + * either shmem or hugetlbfs. An error is returned if pages associated with > > + * hugetlbfs files are not present in the page cache. However, shmem > pages > > + * are swapped in or allocated if they are not present in the page cache. > > Why don't we do the same for hugetlbfs? Would make the interface more > streamlined. I am going off of what Mike has stated previously: "It may not matter to your users, but the semantics for hugetlb and shmem pages is different. hugetlb requires the pages exist in the page cache while shmem will create/add pages to the cache if necessary." However, if we were to allocate a hugepage (assuming one is not present in the page cache at a given index), what needs to be done in addition to calling these APIs? folio = alloc_hugetlb_folio_nodemask(h, NUMA_NO_NODE, NULL, GFP_USER) hugetlb_add_to_page_cache(folio, mapping, idx) > > Certainly add that pinned pages have to be released using > unpin_user_pages(). Sure, will include that in v2. > > > + * > > + * Returns number of pages pinned. This would be equal to the number of > > + * pages requested. > > + * If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns > > + * -errno. > > + */ > > +long pin_user_pages_fd(int fd, pgoff_t start, unsigned long nr_pages, > > + unsigned int gup_flags, struct page **pages) > > +{ > > + struct page *page; > > + struct file *filep; > > + unsigned int flags, i; > > + long ret; > > + > > + if (nr_pages <= 0) > > + return 0; > > I think we should just forbid that and use a WARN_ON_ONCE() here / > return -EINVAL. So we'll never end up returning 0. I think I'll drop this check in v2 as Jason suggested. > > > + if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_PIN)) > > + return 0; > > + > > + if (start < 0) > > + return -EINVAL; > > + > > + filep = fget(fd); > > + if (!filep) > > + return -EINVAL; > > + > > + if (!shmem_file(filep) && !is_file_hugepages(filep)) > > + return -EINVAL; > > + > > + flags = memalloc_pin_save(); > > + do { > > + for (i = 0; i < nr_pages; i++) { > > + if (shmem_mapping(filep->f_mapping)) { > > + page = shmem_read_mapping_page(filep- > >f_mapping, > > + start + i); > > + if (IS_ERR(page)) { > > + ret = PTR_ERR(page); > > + goto err; > > + } > > + } else { > > + page = find_get_page_flags(filep->f_mapping, > > + start + i, > > + FGP_ACCESSED); > > + if (!page) { > > + ret = -EINVAL; > > + goto err; > > + } > > + } > > + ret = try_grab_page(page, FOLL_PIN); > > + if (unlikely(ret)) > > + goto err; > > + > > + pages[i] = page; > > + put_page(pages[i]); > > + } > > + > > + ret = check_and_migrate_movable_pages(nr_pages, pages); > > + } while (ret == -EAGAIN); > > + > > +err: > > + memalloc_pin_restore(flags); > > + fput(filep); > > + if (!ret) > > + return nr_pages; > > + > > + while (i > 0 && pages[--i]) { > > + unpin_user_page(pages[i]); > > + pages[i] = NULL; > > If migrate_longterm_unpinnable_pages() failed, say with -ENOMEM, the > pages were already unpinned, but pages[i] has not been cleared, no? You are right; the above while should not be executed in that case. I added this chunk to cleanup after any errors thrown in the for loop above. I guess I need to add a new error label to cleanup after errors thrown by check_and_migrate_movable_pages(). Thanks, Vivek > > > + } > > + return ret; > > +} > > +EXPORT_SYMBOL_GPL(pin_user_pages_fd); > > + > > -- > Cheers, > > David / dhildenb