On Sun, Nov 03, 2019 at 01:18:00PM -0800, John Hubbard wrote: > Introduce pin_user_pages*() variations of get_user_pages*() calls, > and also pin_longterm_pages*() variations. > > These variants all set FOLL_PIN, which is also introduced, and > thoroughly documented. > > The pin_longterm*() variants also set FOLL_LONGTERM, in addition > to FOLL_PIN: > > pin_user_pages() > pin_user_pages_remote() > pin_user_pages_fast() > > pin_longterm_pages() > pin_longterm_pages_remote() > pin_longterm_pages_fast() > > All pages that are pinned via the above calls, must be unpinned via > put_user_page(). > > The underlying rules are: > > * These are gup-internal flags, so the call sites should not directly > set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with > assertions, for the new FOLL_PIN flag. However, for the pre-existing > FOLL_LONGTERM flag, which has some call sites that still directly > set FOLL_LONGTERM, there is no assertion yet. > > * Call sites that want to indicate that they are going to do DirectIO > ("DIO") or something with similar characteristics, should call a > get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers > will: > * Start with "pin_user_pages" instead of "get_user_pages". That > makes it easy to find and audit the call sites. > * Set FOLL_PIN > > * For pages that are received via FOLL_PIN, those pages must be returned > via put_user_page(). > > Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases > in this documentation. (I've reworded it and expanded on it slightly.) > > Cc: Jonathan Corbet <corbet@xxxxxxx> > Cc: Ira Weiny <ira.weiny@xxxxxxxxx> > Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx> Few nitpick belows, nonetheless: Reviewed-by: Jérôme Glisse <jglisse@xxxxxxxxxx> > --- > Documentation/vm/index.rst | 1 + > Documentation/vm/pin_user_pages.rst | 212 ++++++++++++++++++++++ > include/linux/mm.h | 62 ++++++- > mm/gup.c | 265 +++++++++++++++++++++++++--- > 4 files changed, 514 insertions(+), 26 deletions(-) > create mode 100644 Documentation/vm/pin_user_pages.rst > [...] > diff --git a/Documentation/vm/pin_user_pages.rst b/Documentation/vm/pin_user_pages.rst > new file mode 100644 > index 000000000000..3910f49ca98c > --- /dev/null > +++ b/Documentation/vm/pin_user_pages.rst [...] > + > +FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags > +========================================================== > + > +Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing > +these categories: > + > +CASE 1: Direct IO (DIO) > +----------------------- > +There are GUP references to pages that are serving > +as DIO buffers. These buffers are needed for a relatively short time (so they > +are not "long term"). No special synchronization with page_mkclean() or > +munmap() is provided. Therefore, flags to set at the call site are: :: > + > + FOLL_PIN > + > +...but rather than setting FOLL_PIN directly, call sites should use one of > +the pin_user_pages*() routines that set FOLL_PIN. > + > +CASE 2: RDMA > +------------ > +There are GUP references to pages that are serving as DMA > +buffers. These buffers are needed for a long time ("long term"). No special > +synchronization with page_mkclean() or munmap() is provided. Therefore, flags > +to set at the call site are: :: > + > + FOLL_PIN | FOLL_LONGTERM > + > +NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's > +because DAX pages do not have a separate page cache, and so "pinning" implies > +locking down file system blocks, which is not (yet) supported in that way. > + > +CASE 3: ODP > +----------- > +(Mellanox/Infiniband On Demand Paging: the hardware supports > +replayable page faulting). There are GUP references to pages serving as DMA > +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean() > +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag > +needs to be set. I would not include ODP or anything like it here, they do not use GUP anymore and i believe it is more confusing here. I would how- ever include some text in this documentation explaining that hard- ware that support page fault is superior as it does not incur any of the issues described here. > + > +CASE 4: Pinning for struct page manipulation only > +------------------------------------------------- > +Here, normal GUP calls are sufficient, so neither flag needs to be set. > + [...] > diff --git a/mm/gup.c b/mm/gup.c > index 199da99e8ffc..1aea48427879 100644 > --- a/mm/gup.c > +++ b/mm/gup.c [...] > @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, > BUG_ON(*locked != 1); > } > > - if (pages) > + /* > + * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior > + * is to set FOLL_GET if the caller wants pages[] filled in (but has > + * carelessly failed to specify FOLL_GET), so keep doing that, but only > + * for FOLL_GET, not for the newer FOLL_PIN. > + * > + * FOLL_PIN always expects pages to be non-null, but no need to assert > + * that here, as any failures will be obvious enough. > + */ > + if (pages && !(flags & FOLL_PIN)) > flags |= FOLL_GET; Did you look at user that have pages and not FOLL_GET set ? I believe it would be better to first fix them to end up with FOLL_GET set and then error out if pages is != NULL but nor FOLL_GET or FOLL_PIN is set. > > pages_done = 0; > @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages, > return ret; > } > > -/** > - * get_user_pages_fast() - pin user pages in memory > - * @start: starting user address > - * @nr_pages: number of pages from start to pin > - * @gup_flags: flags modifying pin behaviour > - * @pages: array that receives pointers to the pages pinned. > - * Should be at least nr_pages long. > - * > - * Attempt to pin user pages in memory without taking mm->mmap_sem. > - * If not successful, it will fall back to taking the lock and > - * calling get_user_pages(). > - * > - * Returns number of pages pinned. This may be fewer than the number > - * requested. If nr_pages is 0 or negative, returns 0. If no pages > - * were pinned, returns -errno. > - */ > -int get_user_pages_fast(unsigned long start, int nr_pages, > - unsigned int gup_flags, struct page **pages) > +static int internal_get_user_pages_fast(unsigned long start, int nr_pages, > + unsigned int gup_flags, > + struct page **pages) Usualy function are rename to _old_func_name ie add _ in front. So here it would become _get_user_pages_fast but i know some people don't like that as sometimes we endup with ___function_overloaded :) > { > unsigned long addr, len, end; > int nr = 0, ret = 0; > @@ -2435,4 +2449,215 @@ int get_user_pages_fast(unsigned long start, int nr_pages, [...] > +/** > + * pin_user_pages_remote() - pin pages for (typically) use by Direct IO, and > + * return the pages to the user. Not a fan of (typically) maybe: pin_user_pages_remote() - pin pages of a remote process (task != current) I think here the remote part if more important that DIO. Remote is use by other thing that DIO. > + * > + * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See > + * get_user_pages_remote() for documentation on the function arguments, because > + * the arguments here are identical. > + * > + * FOLL_PIN means that the pages must be released via put_user_page(). Please > + * see Documentation/vm/pin_user_pages.rst for details. > + * > + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It > + * is NOT intended for Case 2 (RDMA: long-term pins). > + */ > +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long start, unsigned long nr_pages, > + unsigned int gup_flags, struct page **pages, > + struct vm_area_struct **vmas, int *locked) > +{ > + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ > + if (WARN_ON_ONCE(gup_flags & FOLL_GET)) > + return -EINVAL; > + > + gup_flags |= FOLL_TOUCH | FOLL_REMOTE | FOLL_PIN; > + > + return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas, > + locked, gup_flags); > +} > +EXPORT_SYMBOL(pin_user_pages_remote); > + > +/** > + * pin_longterm_pages_remote() - pin pages for (typically) use by Direct IO, and > + * return the pages to the user. I think you copy pasted this from pin_user_pages_remote() :) > + * > + * Nearly the same as get_user_pages_remote(), but note that FOLL_TOUCH is not > + * set, and FOLL_PIN and FOLL_LONGTERM are set. See get_user_pages_remote() for > + * documentation on the function arguments, because the arguments here are > + * identical. > + * > + * FOLL_PIN means that the pages must be released via put_user_page(). Please > + * see Documentation/vm/pin_user_pages.rst for further details. > + * > + * FOLL_LONGTERM means that the pages are being pinned for "long term" use, > + * typically by a non-CPU device, and we cannot be sure that waiting for a > + * pinned page to become unpin will be effective. > + * > + * This is intended for Case 2 (RDMA: long-term pins) in > + * Documentation/vm/pin_user_pages.rst. > + */ > +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long start, unsigned long nr_pages, > + unsigned int gup_flags, struct page **pages, > + struct vm_area_struct **vmas, int *locked) > +{ > + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ > + if (WARN_ON_ONCE(gup_flags & FOLL_GET)) > + return -EINVAL; > + > + /* > + * FIXME: as noted in the get_user_pages_remote() implementation, it > + * is not yet possible to safely set FOLL_LONGTERM here. FOLL_LONGTERM > + * needs to be set, but for now the best we can do is a "TODO" item. > + */ > + gup_flags |= FOLL_REMOTE | FOLL_PIN; Wouldn't it be better to not add pin_longterm_pages_remote() until it can be properly implemented ?