Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

Christian Brauner <christian@xxxxxxxxxx> · Fri, 28 Aug 2020 20:25:34 +0200

On Fri, Aug 28, 2020 at 8:24 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>
> On 8/28/20 11:40 AM, Arnd Bergmann wrote:
> > On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
> >> So finally, the API is as follows,
> >>
> >>      ssize_t process_madvise(int pidfd, const struct iovec *iovec,
> >>                unsigned long vlen, int advice, unsigned int flags);
> >
> > I had not followed the discussion earlier and only now came across
> > the syscall in linux-next, sorry for stirring things up this late.
> >
> >> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> >> index 94bf4958d114..8f959d90338a 100644
> >> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> >> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> >> @@ -364,6 +364,7 @@
> >>  440    common  watch_mount             sys_watch_mount
> >>  441    common  watch_sb                sys_watch_sb
> >>  442    common  fsinfo                  sys_fsinfo
> >> +443    64      process_madvise         sys_process_madvise
> >>
> >>  #
> >>  # x32-specific system call numbers start at 512 to avoid cache impact
> >> @@ -407,3 +408,4 @@
> >>  545    x32     execveat                compat_sys_execveat
> >>  546    x32     preadv2                 compat_sys_preadv64v2
> >>  547    x32     pwritev2                compat_sys_pwritev64v2
> >> +548    x32     process_madvise         compat_sys_process_madvise
> >
> > I think we should not add any new x32-specific syscalls. Instead I think
> > the compat_sys_process_madvise/sys_process_madvise can be
> > merged into one.
> >
> >> +       mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> >> +       if (IS_ERR_OR_NULL(mm)) {
> >> +               ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> >> +               goto release_task;
> >> +       }
> >
> > Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile,
> > and I would try to avoid that. Can mm_access() be changed to
> > itself return PTR_ERR(-ESRCH) instead of NULL to improve its
> > calling conventions? I see there are only three other callers.
> >
> >
> >> +       ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
> >> +       if (ret >= 0) {
> >> +               ret = do_process_madvise(pidfd, &iter, behavior, flags);
> >> +               kfree(iov);
> >> +       }
> >> +       return ret;
> >> +}
> >> +
> >> +#ifdef CONFIG_COMPAT
> > ...
> >> +
> >> +       ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
> >> +                               &iov, &iter);
> >> +       if (ret >= 0) {
> >> +               ret = do_process_madvise(pidfd, &iter, behavior, flags);
> >> +               kfree(iov);
> >> +       }
> >
> > Every syscall that passes an iovec seems to do this. If we make import_iovec()
> > handle both cases directly, this syscall and a number of others can
> > be simplified, and you avoid the x32 entry point I mentioned above
> >
> > Something like (untested)
> >
> > index dad8d0cfaaf7..0de4ddff24c1 100644
> > --- a/lib/iov_iter.c
> > +++ b/lib/iov_iter.c
> > @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct
> > iovec __user * uvector,
> >  {
> >         ssize_t n;
> >         struct iovec *p;
> > -       n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs,
> > -                                 *iov, &p);
> > +
> > +       if (in_compat_syscall())

I suggested the exact same solutions roughly 1.5 weeks ago. :)
Fun when I saw you mentioning this in BBB I knew exactly what you were
referring too. :)

Christian