Hi David, > On 05.12.23 06:35, Vivek Kasireddy wrote: > > For drivers that would like to longterm-pin the pages associated > > with a memfd, the pin_user_pages_fd() API provides an option to > > not only pin the pages via FOLL_PIN but also to check and migrate > > them if they reside in movable zone or CMA block. This API > > currently works with memfds but it should work with any files > > that belong to either shmemfs or hugetlbfs. Files belonging to > > other filesystems are rejected for now. > > > > The pages need to be located first before pinning them via FOLL_PIN. > > If they are found in the page cache, they can be immediately pinned. > > Otherwise, they need to be allocated using the filesystem specific > > APIs and then pinned. > > > > v2: > > - Drop gup_flags and improve comments and commit message (David) > > - Allocate a page if we cannot find in page cache for the hugetlbfs > > case as well (David) > > - Don't unpin pages if there is a migration related failure (David) > > - Drop the unnecessary nr_pages <= 0 check (Jason) > > - Have the caller of the API pass in file * instead of fd (Jason) > > > > v3: (David) > > - Enclose the huge page allocation code with #ifdef > CONFIG_HUGETLB_PAGE > > (Build error reported by kernel test robot <lkp@xxxxxxxxx>) > > - Don't forget memalloc_pin_restore() on non-migration related errors > > - Improve the readability of the cleanup code associated with > > non-migration related errors > > - Augment the comments by describing FOLL_LONGTERM like behavior > > - Include the R-b tag from Jason > > > > v4: > > - Remove the local variable "page" and instead use 3 return statements > > in alloc_file_page() (David) > > - Add the R-b tag from David > > > > v5: (David) > > - For hugetlb case, ensure that we only obtain head pages from the > > mapping by using __filemap_get_folio() instead of find_get_page_flags() > > - Handle -EEXIST when two or more potential users try to simultaneously > > add a huge page to the mapping by forcing them to retry on failure > > > > v6: (Christoph) > > - Rename this API to memfd_pin_user_pages() to make it clear that it > > is intended for memfds > > - Move the memfd page allocation helper from gup.c to memfd.c > > - Fix indentation errors in memfd_pin_user_pages() > > - For contiguous ranges of folios, use a helper such as > > filemap_get_folios_contig() to lookup the page cache in batches > > > > Cc: David Hildenbrand <david@xxxxxxxxxx> > > Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx> > > Cc: Daniel Vetter <daniel.vetter@xxxxxxxx> > > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > > Cc: Hugh Dickins <hughd@xxxxxxxxxx> > > Cc: Peter Xu <peterx@xxxxxxxxxx> > > Cc: Gerd Hoffmann <kraxel@xxxxxxxxxx> > > Cc: Dongwon Kim <dongwon.kim@xxxxxxxxx> > > Cc: Junxiao Chang <junxiao.chang@xxxxxxxxx> > > Suggested-by: Jason Gunthorpe <jgg@xxxxxxxxxx> > > Reviewed-by: Jason Gunthorpe <jgg@xxxxxxxxxx> (v2) > > Reviewed-by: David Hildenbrand <david@xxxxxxxxxx> (v3) > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@xxxxxxxxx> > > --- > > include/linux/memfd.h | 5 +++ > > include/linux/mm.h | 2 + > > mm/gup.c | 102 ++++++++++++++++++++++++++++++++++++++++++ > > mm/memfd.c | 34 ++++++++++++++ > > 4 files changed, 143 insertions(+) > > > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h > > index e7abf6fa4c52..6fc0d1282151 100644 > > --- a/include/linux/memfd.h > > +++ b/include/linux/memfd.h > > @@ -6,11 +6,16 @@ > > > > #ifdef CONFIG_MEMFD_CREATE > > extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int > arg); > > +extern struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx); > > #else > > static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a) > > { > > return -EINVAL; > > } > > +static inline struct page *memfd_alloc_page(struct file *memfd, pgoff_t > idx) > > +{ > > + return ERR_PTR(-EINVAL); > > +} > > #endif > > > > #endif /* __LINUX_MEMFD_H */ > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 418d26608ece..ac69db45509f 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long > start, unsigned long nr_pages, > > struct page **pages, unsigned int gup_flags); > > long pin_user_pages_unlocked(unsigned long start, unsigned long > nr_pages, > > struct page **pages, unsigned int gup_flags); > > +long memfd_pin_user_pages(struct file *file, pgoff_t start, > > + unsigned long nr_pages, struct page **pages); > > > > int get_user_pages_fast(unsigned long start, int nr_pages, > > unsigned int gup_flags, struct page **pages); > > diff --git a/mm/gup.c b/mm/gup.c > > index 231711efa390..eb93d1ec9dc6 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -5,6 +5,7 @@ > > #include <linux/spinlock.h> > > > > #include <linux/mm.h> > > +#include <linux/memfd.h> > > #include <linux/memremap.h> > > #include <linux/pagemap.h> > > #include <linux/rmap.h> > > @@ -17,6 +18,7 @@ > > #include <linux/hugetlb.h> > > #include <linux/migrate.h> > > #include <linux/mm_inline.h> > > +#include <linux/pagevec.h> > > #include <linux/sched/mm.h> > > #include <linux/shmem_fs.h> > > > > @@ -3410,3 +3412,103 @@ long pin_user_pages_unlocked(unsigned long > start, unsigned long nr_pages, > > &locked, gup_flags); > > } > > EXPORT_SYMBOL(pin_user_pages_unlocked); > > + > > +/** > > + * memfd_pin_user_pages() - pin user pages associated with a memfd > > + * @memfd: the memfd whose pages are to be pinned > > + * @start: starting memfd offset > > + * @nr_pages: number of pages from start to pin > > + * @pages: array that receives pointers to the pages pinned. > > + * Should be at-least nr_pages long. > > + * > > + * Attempt to pin pages associated with a memfd; given that a memfd is > either > > + * backed by shmem or hugetlb, the pages can either be found in the page > cache > > + * or need to be allocated if necessary. Once the pages are located, they > are > > + * all pinned via FOLL_PIN. And, these pinned pages need to be released > either > > + * using unpin_user_pages() or unpin_user_page(). > > + * > > + * It must be noted that the pages may be pinned for an indefinite amount > > + * of time. And, in most cases, the duration of time they may stay pinned > > + * would be controlled by the userspace. This behavior is effectively the > > + * same as using FOLL_LONGTERM with other GUP APIs. > > + * > > + * Returns number of pages pinned. This would be equal to the number of > > + * pages requested. If no pages were pinned, it returns -errno. > > + */ > > +long memfd_pin_user_pages(struct file *memfd, pgoff_t start, > > + unsigned long nr_pages, struct page **pages) > > +{ > > + pgoff_t start_idx, end_idx = start + nr_pages - 1; > > + unsigned int flags, nr_folios, i, j; > > + struct folio_batch fbatch; > > + struct page *page = NULL; > > + struct folio *folio; > > + long ret; > > + > > + if (!nr_pages) > > + return -EINVAL; > > + > > + if (!memfd) > > + return -EINVAL; > > + > > + if (!shmem_file(memfd) && !is_file_hugepages(memfd)) > > + return -EINVAL; > > + > > + flags = memalloc_pin_save(); > > + do { > > + folio_batch_init(&fbatch); > > + start_idx = start; > > + i = 0; > > + > > + while (start_idx <= end_idx) { > > + /* > > + * In most cases, we should be able to find the page > > + * in the page cache. If we cannot find it for some > > + * reason, we try to allocate one and add it to the > > + * page cache. > > + */ > > + nr_folios = filemap_get_folios_contig(memfd- > >f_mapping, > > + &start_idx, > > + end_idx, > > + &fbatch); > > + if (page) { > > + put_page(page); > > + page = NULL; > > + } > > + for (j = 0; j < nr_folios; j++) { > > + folio = fbatch.folios[j]; > > + ret = try_grab_page(&folio->page, FOLL_PIN); > > + if (unlikely(ret)) { > > + folio_batch_release(&fbatch); > > + goto err; > > + } > > + > > + pages[i++] = &folio->page; > > + } > > I might be wrong, but that interface is still inconsistent. I think your > intention is to always return folios (head pages), but why are we > returning pages from this interface then? > > It would be more consistent regarding the other GUP interfaces to return > the actual tail pages that fit the given "pgoff_t start". So if you > punch in "nr_pages" you expect to get "nr_pages" pages, and not some > other number of folios. > > Otherwise, this interface is highly confusing. > > If you always want to return folios, then better name it > "memfd_pin_user_folios" (or just "memfd_pin_folios") and pass in a range > (instead of a nr_pages parameter), and somehow indicate to the caller > how many folio were in that range, and if that range was fully covered. I think it makes sense to return folios from this interface; and considering my use-case, I'd like have this API return an error if it cannot pin (or allocate) the exact number of folios the caller requested. > > Or am I missing something? I can make the udmabuf driver use folios instead of pages too but the function check_and_migrate_movable_pages() in GUP still takes a list of pages. Do you think it is ok to use a local variable to collect all the head pages for this? Thanks, Vivek > > -- > Cheers, > > David / dhildenb >