On 8/28/20 11:40 AM, Arnd Bergmann wrote: > On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim <minchan@xxxxxxxxxx> wrote: >> So finally, the API is as follows, >> >> ssize_t process_madvise(int pidfd, const struct iovec *iovec, >> unsigned long vlen, int advice, unsigned int flags); > > I had not followed the discussion earlier and only now came across > the syscall in linux-next, sorry for stirring things up this late. > >> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl >> index 94bf4958d114..8f959d90338a 100644 >> --- a/arch/x86/entry/syscalls/syscall_64.tbl >> +++ b/arch/x86/entry/syscalls/syscall_64.tbl >> @@ -364,6 +364,7 @@ >> 440 common watch_mount sys_watch_mount >> 441 common watch_sb sys_watch_sb >> 442 common fsinfo sys_fsinfo >> +443 64 process_madvise sys_process_madvise >> >> # >> # x32-specific system call numbers start at 512 to avoid cache impact >> @@ -407,3 +408,4 @@ >> 545 x32 execveat compat_sys_execveat >> 546 x32 preadv2 compat_sys_preadv64v2 >> 547 x32 pwritev2 compat_sys_pwritev64v2 >> +548 x32 process_madvise compat_sys_process_madvise > > I think we should not add any new x32-specific syscalls. Instead I think > the compat_sys_process_madvise/sys_process_madvise can be > merged into one. > >> + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); >> + if (IS_ERR_OR_NULL(mm)) { >> + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; >> + goto release_task; >> + } > > Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile, > and I would try to avoid that. Can mm_access() be changed to > itself return PTR_ERR(-ESRCH) instead of NULL to improve its > calling conventions? I see there are only three other callers. > > >> + ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter); >> + if (ret >= 0) { >> + ret = do_process_madvise(pidfd, &iter, behavior, flags); >> + kfree(iov); >> + } >> + return ret; >> +} >> + >> +#ifdef CONFIG_COMPAT > ... >> + >> + ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), >> + &iov, &iter); >> + if (ret >= 0) { >> + ret = do_process_madvise(pidfd, &iter, behavior, flags); >> + kfree(iov); >> + } > > Every syscall that passes an iovec seems to do this. If we make import_iovec() > handle both cases directly, this syscall and a number of others can > be simplified, and you avoid the x32 entry point I mentioned above > > Something like (untested) > > index dad8d0cfaaf7..0de4ddff24c1 100644 > --- a/lib/iov_iter.c > +++ b/lib/iov_iter.c > @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct > iovec __user * uvector, > { > ssize_t n; > struct iovec *p; > - n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs, > - *iov, &p); > + > + if (in_compat_syscall()) > + n = compat_rw_copy_check_uvector(type, uvector, nr_segs, > + fast_segs, *iov, &p); > + else > + n = rw_copy_check_uvector(type, uvector, nr_segs, > + fast_segs, *iov, &p); > if (n < 0) { > if (p != *iov) > kfree(p); Doesn't work for the async case, where you want to be holding on to the allocated iovec. But in general I think it's a good helper for the sync case, which is by far the majority. -- Jens Axboe