On 8/28/20 12:24 PM, Jens Axboe wrote: > On 8/28/20 11:40 AM, Arnd Bergmann wrote: >> On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim <minchan@xxxxxxxxxx> wrote: >>> So finally, the API is as follows, >>> >>> ssize_t process_madvise(int pidfd, const struct iovec *iovec, >>> unsigned long vlen, int advice, unsigned int flags); >> >> I had not followed the discussion earlier and only now came across >> the syscall in linux-next, sorry for stirring things up this late. >> >>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl >>> index 94bf4958d114..8f959d90338a 100644 >>> --- a/arch/x86/entry/syscalls/syscall_64.tbl >>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl >>> @@ -364,6 +364,7 @@ >>> 440 common watch_mount sys_watch_mount >>> 441 common watch_sb sys_watch_sb >>> 442 common fsinfo sys_fsinfo >>> +443 64 process_madvise sys_process_madvise >>> >>> # >>> # x32-specific system call numbers start at 512 to avoid cache impact >>> @@ -407,3 +408,4 @@ >>> 545 x32 execveat compat_sys_execveat >>> 546 x32 preadv2 compat_sys_preadv64v2 >>> 547 x32 pwritev2 compat_sys_pwritev64v2 >>> +548 x32 process_madvise compat_sys_process_madvise >> >> I think we should not add any new x32-specific syscalls. Instead I think >> the compat_sys_process_madvise/sys_process_madvise can be >> merged into one. >> >>> + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); >>> + if (IS_ERR_OR_NULL(mm)) { >>> + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; >>> + goto release_task; >>> + } >> >> Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile, >> and I would try to avoid that. Can mm_access() be changed to >> itself return PTR_ERR(-ESRCH) instead of NULL to improve its >> calling conventions? I see there are only three other callers. >> >> >>> + ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter); >>> + if (ret >= 0) { >>> + ret = do_process_madvise(pidfd, &iter, behavior, flags); >>> + kfree(iov); >>> + } >>> + return ret; >>> +} >>> + >>> +#ifdef CONFIG_COMPAT >> ... >>> + >>> + ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), >>> + &iov, &iter); >>> + if (ret >= 0) { >>> + ret = do_process_madvise(pidfd, &iter, behavior, flags); >>> + kfree(iov); >>> + } >> >> Every syscall that passes an iovec seems to do this. If we make import_iovec() >> handle both cases directly, this syscall and a number of others can >> be simplified, and you avoid the x32 entry point I mentioned above >> >> Something like (untested) >> >> index dad8d0cfaaf7..0de4ddff24c1 100644 >> --- a/lib/iov_iter.c >> +++ b/lib/iov_iter.c >> @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct >> iovec __user * uvector, >> { >> ssize_t n; >> struct iovec *p; >> - n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs, >> - *iov, &p); >> + >> + if (in_compat_syscall()) >> + n = compat_rw_copy_check_uvector(type, uvector, nr_segs, >> + fast_segs, *iov, &p); >> + else >> + n = rw_copy_check_uvector(type, uvector, nr_segs, >> + fast_segs, *iov, &p); >> if (n < 0) { >> if (p != *iov) >> kfree(p); > > Doesn't work for the async case, where you want to be holding on to the > allocated iovec. But in general I think it's a good helper for the sync > case, which is by far the majority. Nevermind, I'm an idiot for reading this totally wrong. -- Jens Axboe