Hi Matthieu On 14 November 2017 at 21:03, Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote: > This new cpu_opv system call executes a vector of operations on behalf > of user-space on a specific CPU with preemption disabled. It is inspired > from readv() and writev() system calls which take a "struct iovec" array > as argument. Do you have a man page spfr this syscall already? Thanks, Michael > The operations available are: comparison, memcpy, add, or, and, xor, > left shift, right shift, and mb. The system call receives a CPU number > from user-space as argument, which is the CPU on which those operations > need to be performed. All preparation steps such as loading pointers, > and applying offsets to arrays, need to be performed by user-space > before invoking the system call. The "comparison" operation can be used > to check that the data used in the preparation step did not change > between preparation of system call inputs and operation execution within > the preempt-off critical section. > > The reason why we require all pointer offsets to be calculated by > user-space beforehand is because we need to use get_user_pages_fast() to > first pin all pages touched by each operation. This takes care of > faulting-in the pages. Then, preemption is disabled, and the operations > are performed atomically with respect to other thread execution on that > CPU, without generating any page fault. > > A maximum limit of 16 operations per cpu_opv syscall invocation is > enforced, so user-space cannot generate a too long preempt-off critical > section. Each operation is also limited a length of PAGE_SIZE bytes, > meaning that an operation can touch a maximum of 4 pages (memcpy: 2 > pages for source, 2 pages for destination if addresses are not aligned > on page boundaries). Moreover, a total limit of 4216 bytes is applied > to operation lengths. > > If the thread is not running on the requested CPU, a new > push_task_to_cpu() is invoked to migrate the task to the requested CPU. > If the requested CPU is not part of the cpus allowed mask of the thread, > the system call fails with EINVAL. After the migration has been > performed, preemption is disabled, and the current CPU number is checked > again and compared to the requested CPU number. If it still differs, it > means the scheduler migrated us away from that CPU. Return EAGAIN to > user-space in that case, and let user-space retry (either requesting the > same CPU number, or a different one, depending on the user-space > algorithm constraints). > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > CC: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> > CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > CC: Paul Turner <pjt@xxxxxxxxxx> > CC: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > CC: Andrew Hunter <ahh@xxxxxxxxxx> > CC: Andy Lutomirski <luto@xxxxxxxxxxxxxx> > CC: Andi Kleen <andi@xxxxxxxxxxxxxx> > CC: Dave Watson <davejwatson@xxxxxx> > CC: Chris Lameter <cl@xxxxxxxxx> > CC: Ingo Molnar <mingo@xxxxxxxxxx> > CC: "H. Peter Anvin" <hpa@xxxxxxxxx> > CC: Ben Maurer <bmaurer@xxxxxx> > CC: Steven Rostedt <rostedt@xxxxxxxxxxx> > CC: Josh Triplett <josh@xxxxxxxxxxxxxxxx> > CC: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > CC: Russell King <linux@xxxxxxxxxxxxxxxx> > CC: Catalin Marinas <catalin.marinas@xxxxxxx> > CC: Will Deacon <will.deacon@xxxxxxx> > CC: Michael Kerrisk <mtk.manpages@xxxxxxxxx> > CC: Boqun Feng <boqun.feng@xxxxxxxxx> > CC: linux-api@xxxxxxxxxxxxxxx > --- > > Changes since v1: > - handle CPU hotplug, > - cleanup implementation using function pointers: We can use function > pointers to implement the operations rather than duplicating all the > user-access code. > - refuse device pages: Performing cpu_opv operations on io map'd pages > with preemption disabled could generate long preempt-off critical > sections, which leads to unwanted scheduler latency. Return EFAULT if > a device page is received as parameter > - restrict op vector to 4216 bytes length sum: Restrict the operation > vector to length sum of: > - 4096 bytes (typical page size on most architectures, should be > enough for a string, or structures) > - 15 * 8 bytes (typical operations on integers or pointers). > The goal here is to keep the duration of preempt off critical section > short, so we don't add significant scheduler latency. > - Add INIT_ONSTACK macro: Introduce the > CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users > correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their > stack to 0 on 32-bit architectures. > - Add CPU_MB_OP operation: > Use-cases with: > - two consecutive stores, > - a mempcy followed by a store, > require a memory barrier before the final store operation. A typical > use-case is a store-release on the final store. Given that this is a > slow path, just providing an explicit full barrier instruction should > be sufficient. > - Add expect fault field: > The use-case of list_pop brings interesting challenges. With rseq, we > can use rseq_cmpnev_storeoffp_load(), and therefore load a pointer, > compare it against NULL, add an offset, and load the target "next" > pointer from the object, all within a single req critical section. > > Life is not so easy for cpu_opv in this use-case, mainly because we > need to pin all pages we are going to touch in the preempt-off > critical section beforehand. So we need to know the target object (in > which we apply an offset to fetch the next pointer) when we pin pages > before disabling preemption. > > So the approach is to load the head pointer and compare it against > NULL in user-space, before doing the cpu_opv syscall. User-space can > then compute the address of the head->next field, *without loading it*. > > The cpu_opv system call will first need to pin all pages associated > with input data. This includes the page backing the head->next object, > which may have been concurrently deallocated and unmapped. Therefore, > in this case, getting -EFAULT when trying to pin those pages may > happen: it just means they have been concurrently unmapped. This is > an expected situation, and should just return -EAGAIN to user-space, > to user-space can distinguish between "should retry" type of > situations and actual errors that should be handled with extreme > prejudice to the program (e.g. abort()). > > Therefore, add "expect_fault" fields along with op input address > pointers, so user-space can identify whether a fault when getting a > field should return EAGAIN rather than EFAULT. > - Add compiler barrier between operations: Adding a compiler barrier > between store operations in a cpu_opv sequence can be useful when > paired with membarrier system call. > > An algorithm with a paired slow path and fast path can use > sys_membarrier on the slow path to replace fast-path memory barriers > by compiler barrier. > > Adding an explicit compiler barrier between operations allows > cpu_opv to be used as fallback for operations meant to match > the membarrier system call. > > Changes since v2: > > - Fix memory leak by introducing struct cpu_opv_pinned_pages. > Suggested by Boqun Feng. > - Cast argument 1 passed to access_ok from integer to void __user *, > fixing sparse warning. > --- > MAINTAINERS | 7 + > include/uapi/linux/cpu_opv.h | 117 ++++++ > init/Kconfig | 14 + > kernel/Makefile | 1 + > kernel/cpu_opv.c | 968 +++++++++++++++++++++++++++++++++++++++++++ > kernel/sched/core.c | 37 ++ > kernel/sched/sched.h | 2 + > kernel/sys_ni.c | 1 + > 8 files changed, 1147 insertions(+) > create mode 100644 include/uapi/linux/cpu_opv.h > create mode 100644 kernel/cpu_opv.c > > diff --git a/MAINTAINERS b/MAINTAINERS > index c9f95f8b07ed..45a1bbdaa287 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -3675,6 +3675,13 @@ B: https://bugzilla.kernel.org > F: drivers/cpuidle/* > F: include/linux/cpuidle.h > > +CPU NON-PREEMPTIBLE OPERATION VECTOR SUPPORT > +M: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > +L: linux-kernel@xxxxxxxxxxxxxxx > +S: Supported > +F: kernel/cpu_opv.c > +F: include/uapi/linux/cpu_opv.h > + > CRAMFS FILESYSTEM > W: http://sourceforge.net/projects/cramfs/ > S: Orphan / Obsolete > diff --git a/include/uapi/linux/cpu_opv.h b/include/uapi/linux/cpu_opv.h > new file mode 100644 > index 000000000000..17f7d46e053b > --- /dev/null > +++ b/include/uapi/linux/cpu_opv.h > @@ -0,0 +1,117 @@ > +#ifndef _UAPI_LINUX_CPU_OPV_H > +#define _UAPI_LINUX_CPU_OPV_H > + > +/* > + * linux/cpu_opv.h > + * > + * CPU preempt-off operation vector system call API > + * > + * Copyright (c) 2017 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > + * > + * Permission is hereby granted, free of charge, to any person obtaining a copy > + * of this software and associated documentation files (the "Software"), to deal > + * in the Software without restriction, including without limitation the rights > + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell > + * copies of the Software, and to permit persons to whom the Software is > + * furnished to do so, subject to the following conditions: > + * > + * The above copyright notice and this permission notice shall be included in > + * all copies or substantial portions of the Software. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE > + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, > + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > + > +#ifdef __KERNEL__ > +# include <linux/types.h> > +#else /* #ifdef __KERNEL__ */ > +# include <stdint.h> > +#endif /* #else #ifdef __KERNEL__ */ > + > +#include <asm/byteorder.h> > + > +#ifdef __LP64__ > +# define CPU_OP_FIELD_u32_u64(field) uint64_t field > +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v > +#elif defined(__BYTE_ORDER) ? \ > + __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) > +# define CPU_OP_FIELD_u32_u64(field) uint32_t field ## _padding, field > +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) \ > + field ## _padding = 0, field = (intptr_t)v > +#else > +# define CPU_OP_FIELD_u32_u64(field) uint32_t field, field ## _padding > +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) \ > + field = (intptr_t)v, field ## _padding = 0 > +#endif > + > +#define CPU_OP_VEC_LEN_MAX 16 > +#define CPU_OP_ARG_LEN_MAX 24 > +/* Max. data len per operation. */ > +#define CPU_OP_DATA_LEN_MAX PAGE_SIZE > +/* > + * Max. data len for overall vector. We to restrict the amount of > + * user-space data touched by the kernel in non-preemptible context so > + * we do not introduce long scheduler latencies. > + * This allows one copy of up to 4096 bytes, and 15 operations touching > + * 8 bytes each. > + * This limit is applied to the sum of length specified for all > + * operations in a vector. > + */ > +#define CPU_OP_VEC_DATA_LEN_MAX (4096 + 15*8) > +#define CPU_OP_MAX_PAGES 4 /* Max. pages per op. */ > + > +enum cpu_op_type { > + CPU_COMPARE_EQ_OP, /* compare */ > + CPU_COMPARE_NE_OP, /* compare */ > + CPU_MEMCPY_OP, /* memcpy */ > + CPU_ADD_OP, /* arithmetic */ > + CPU_OR_OP, /* bitwise */ > + CPU_AND_OP, /* bitwise */ > + CPU_XOR_OP, /* bitwise */ > + CPU_LSHIFT_OP, /* shift */ > + CPU_RSHIFT_OP, /* shift */ > + CPU_MB_OP, /* memory barrier */ > +}; > + > +/* Vector of operations to perform. Limited to 16. */ > +struct cpu_op { > + int32_t op; /* enum cpu_op_type. */ > + uint32_t len; /* data length, in bytes. */ > + union { > + struct { > + CPU_OP_FIELD_u32_u64(a); > + CPU_OP_FIELD_u32_u64(b); > + uint8_t expect_fault_a; > + uint8_t expect_fault_b; > + } compare_op; > + struct { > + CPU_OP_FIELD_u32_u64(dst); > + CPU_OP_FIELD_u32_u64(src); > + uint8_t expect_fault_dst; > + uint8_t expect_fault_src; > + } memcpy_op; > + struct { > + CPU_OP_FIELD_u32_u64(p); > + int64_t count; > + uint8_t expect_fault_p; > + } arithmetic_op; > + struct { > + CPU_OP_FIELD_u32_u64(p); > + uint64_t mask; > + uint8_t expect_fault_p; > + } bitwise_op; > + struct { > + CPU_OP_FIELD_u32_u64(p); > + uint32_t bits; > + uint8_t expect_fault_p; > + } shift_op; > + char __padding[CPU_OP_ARG_LEN_MAX]; > + } u; > +}; > + > +#endif /* _UAPI_LINUX_CPU_OPV_H */ > diff --git a/init/Kconfig b/init/Kconfig > index cbedfb91b40a..e4fbb5dd6a24 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -1404,6 +1404,7 @@ config RSEQ > bool "Enable rseq() system call" if EXPERT > default y > depends on HAVE_RSEQ > + select CPU_OPV > select MEMBARRIER > help > Enable the restartable sequences system call. It provides a > @@ -1414,6 +1415,19 @@ config RSEQ > > If unsure, say Y. > > +config CPU_OPV > + bool "Enable cpu_opv() system call" if EXPERT > + default y > + help > + Enable the CPU preempt-off operation vector system call. > + It allows user-space to perform a sequence of operations on > + per-cpu data with preemption disabled. Useful as > + single-stepping fall-back for restartable sequences, and for > + performing more complex operations on per-cpu data that would > + not be otherwise possible to do with restartable sequences. > + > + If unsure, say Y. > + > config EMBEDDED > bool "Embedded system" > option allnoconfig_y > diff --git a/kernel/Makefile b/kernel/Makefile > index 3574669dafd9..cac8855196ff 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -113,6 +113,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o > > obj-$(CONFIG_HAS_IOMEM) += memremap.o > obj-$(CONFIG_RSEQ) += rseq.o > +obj-$(CONFIG_CPU_OPV) += cpu_opv.o > > $(obj)/configs.o: $(obj)/config_data.h > > diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c > new file mode 100644 > index 000000000000..a81837a14b17 > --- /dev/null > +++ b/kernel/cpu_opv.c > @@ -0,0 +1,968 @@ > +/* > + * CPU preempt-off operation vector system call > + * > + * It allows user-space to perform a sequence of operations on per-cpu > + * data with preemption disabled. Useful as single-stepping fall-back > + * for restartable sequences, and for performing more complex operations > + * on per-cpu data that would not be otherwise possible to do with > + * restartable sequences. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * Copyright (C) 2017, EfficiOS Inc., > + * Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > + */ > + > +#include <linux/sched.h> > +#include <linux/uaccess.h> > +#include <linux/syscalls.h> > +#include <linux/cpu_opv.h> > +#include <linux/types.h> > +#include <linux/mutex.h> > +#include <linux/pagemap.h> > +#include <asm/ptrace.h> > +#include <asm/byteorder.h> > + > +#include "sched/sched.h" > + > +#define TMP_BUFLEN 64 > +#define NR_PINNED_PAGES_ON_STACK 8 > + > +union op_fn_data { > + uint8_t _u8; > + uint16_t _u16; > + uint32_t _u32; > + uint64_t _u64; > +#if (BITS_PER_LONG < 64) > + uint32_t _u64_split[2]; > +#endif > +}; > + > +struct cpu_opv_pinned_pages { > + struct page **pages; > + size_t nr; > + bool is_kmalloc; > +}; > + > +typedef int (*op_fn_t)(union op_fn_data *data, uint64_t v, uint32_t len); > + > +static DEFINE_MUTEX(cpu_opv_offline_lock); > + > +/* > + * The cpu_opv system call executes a vector of operations on behalf of > + * user-space on a specific CPU with preemption disabled. It is inspired > + * from readv() and writev() system calls which take a "struct iovec" > + * array as argument. > + * > + * The operations available are: comparison, memcpy, add, or, and, xor, > + * left shift, and right shift. The system call receives a CPU number > + * from user-space as argument, which is the CPU on which those > + * operations need to be performed. All preparation steps such as > + * loading pointers, and applying offsets to arrays, need to be > + * performed by user-space before invoking the system call. The > + * "comparison" operation can be used to check that the data used in the > + * preparation step did not change between preparation of system call > + * inputs and operation execution within the preempt-off critical > + * section. > + * > + * The reason why we require all pointer offsets to be calculated by > + * user-space beforehand is because we need to use get_user_pages_fast() > + * to first pin all pages touched by each operation. This takes care of > + * faulting-in the pages. Then, preemption is disabled, and the > + * operations are performed atomically with respect to other thread > + * execution on that CPU, without generating any page fault. > + * > + * A maximum limit of 16 operations per cpu_opv syscall invocation is > + * enforced, and a overall maximum length sum, so user-space cannot > + * generate a too long preempt-off critical section. Each operation is > + * also limited a length of PAGE_SIZE bytes, meaning that an operation > + * can touch a maximum of 4 pages (memcpy: 2 pages for source, 2 pages > + * for destination if addresses are not aligned on page boundaries). > + * > + * If the thread is not running on the requested CPU, a new > + * push_task_to_cpu() is invoked to migrate the task to the requested > + * CPU. If the requested CPU is not part of the cpus allowed mask of > + * the thread, the system call fails with EINVAL. After the migration > + * has been performed, preemption is disabled, and the current CPU > + * number is checked again and compared to the requested CPU number. If > + * it still differs, it means the scheduler migrated us away from that > + * CPU. Return EAGAIN to user-space in that case, and let user-space > + * retry (either requesting the same CPU number, or a different one, > + * depending on the user-space algorithm constraints). > + */ > + > +/* > + * Check operation types and length parameters. > + */ > +static int cpu_opv_check(struct cpu_op *cpuop, int cpuopcnt) > +{ > + int i; > + uint32_t sum = 0; > + > + for (i = 0; i < cpuopcnt; i++) { > + struct cpu_op *op = &cpuop[i]; > + > + switch (op->op) { > + case CPU_MB_OP: > + break; > + default: > + sum += op->len; > + } > + switch (op->op) { > + case CPU_COMPARE_EQ_OP: > + case CPU_COMPARE_NE_OP: > + case CPU_MEMCPY_OP: > + if (op->len > CPU_OP_DATA_LEN_MAX) > + return -EINVAL; > + break; > + case CPU_ADD_OP: > + case CPU_OR_OP: > + case CPU_AND_OP: > + case CPU_XOR_OP: > + switch (op->len) { > + case 1: > + case 2: > + case 4: > + case 8: > + break; > + default: > + return -EINVAL; > + } > + break; > + case CPU_LSHIFT_OP: > + case CPU_RSHIFT_OP: > + switch (op->len) { > + case 1: > + if (op->u.shift_op.bits > 7) > + return -EINVAL; > + break; > + case 2: > + if (op->u.shift_op.bits > 15) > + return -EINVAL; > + break; > + case 4: > + if (op->u.shift_op.bits > 31) > + return -EINVAL; > + break; > + case 8: > + if (op->u.shift_op.bits > 63) > + return -EINVAL; > + break; > + default: > + return -EINVAL; > + } > + break; > + case CPU_MB_OP: > + break; > + default: > + return -EINVAL; > + } > + } > + if (sum > CPU_OP_VEC_DATA_LEN_MAX) > + return -EINVAL; > + return 0; > +} > + > +static unsigned long cpu_op_range_nr_pages(unsigned long addr, > + unsigned long len) > +{ > + return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1; > +} > + > +static int cpu_op_check_page(struct page *page) > +{ > + struct address_space *mapping; > + > + if (is_zone_device_page(page)) > + return -EFAULT; > + page = compound_head(page); > + mapping = READ_ONCE(page->mapping); > + if (!mapping) { > + int shmem_swizzled; > + > + /* > + * Check again with page lock held to guard against > + * memory pressure making shmem_writepage move the page > + * from filecache to swapcache. > + */ > + lock_page(page); > + shmem_swizzled = PageSwapCache(page) || page->mapping; > + unlock_page(page); > + if (shmem_swizzled) > + return -EAGAIN; > + return -EFAULT; > + } > + return 0; > +} > + > +/* > + * Refusing device pages, the zero page, pages in the gate area, and > + * special mappings. Inspired from futex.c checks. > + */ > +static int cpu_op_check_pages(struct page **pages, > + unsigned long nr_pages) > +{ > + unsigned long i; > + > + for (i = 0; i < nr_pages; i++) { > + int ret; > + > + ret = cpu_op_check_page(pages[i]); > + if (ret) > + return ret; > + } > + return 0; > +} > + > +static int cpu_op_pin_pages(unsigned long addr, unsigned long len, > + struct cpu_opv_pinned_pages *pin_pages, int write) > +{ > + struct page *pages[2]; > + int ret, nr_pages; > + > + if (!len) > + return 0; > + nr_pages = cpu_op_range_nr_pages(addr, len); > + BUG_ON(nr_pages > 2); > + if (!pin_pages->is_kmalloc && pin_pages->nr + nr_pages > + > NR_PINNED_PAGES_ON_STACK) { > + struct page **pinned_pages = > + kzalloc(CPU_OP_VEC_LEN_MAX * CPU_OP_MAX_PAGES > + * sizeof(struct page *), GFP_KERNEL); > + if (!pinned_pages) > + return -ENOMEM; > + memcpy(pinned_pages, pin_pages->pages, > + pin_pages->nr * sizeof(struct page *)); > + pin_pages->pages = pinned_pages; > + pin_pages->is_kmalloc = true; > + } > +again: > + ret = get_user_pages_fast(addr, nr_pages, write, pages); > + if (ret < nr_pages) { > + if (ret > 0) > + put_page(pages[0]); > + return -EFAULT; > + } > + /* > + * Refuse device pages, the zero page, pages in the gate area, > + * and special mappings. > + */ > + ret = cpu_op_check_pages(pages, nr_pages); > + if (ret == -EAGAIN) { > + put_page(pages[0]); > + if (nr_pages > 1) > + put_page(pages[1]); > + goto again; > + } > + if (ret) > + goto error; > + pin_pages->pages[pin_pages->nr++] = pages[0]; > + if (nr_pages > 1) > + pin_pages->pages[pin_pages->nr++] = pages[1]; > + return 0; > + > +error: > + put_page(pages[0]); > + if (nr_pages > 1) > + put_page(pages[1]); > + return -EFAULT; > +} > + > +static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt, > + struct cpu_opv_pinned_pages *pin_pages) > +{ > + int ret, i; > + bool expect_fault = false; > + > + /* Check access, pin pages. */ > + for (i = 0; i < cpuopcnt; i++) { > + struct cpu_op *op = &cpuop[i]; > + > + switch (op->op) { > + case CPU_COMPARE_EQ_OP: > + case CPU_COMPARE_NE_OP: > + ret = -EFAULT; > + expect_fault = op->u.compare_op.expect_fault_a; > + if (!access_ok(VERIFY_READ, > + (void __user *)op->u.compare_op.a, > + op->len)) > + goto error; > + ret = cpu_op_pin_pages( > + (unsigned long)op->u.compare_op.a, > + op->len, pin_pages, 0); > + if (ret) > + goto error; > + ret = -EFAULT; > + expect_fault = op->u.compare_op.expect_fault_b; > + if (!access_ok(VERIFY_READ, > + (void __user *)op->u.compare_op.b, > + op->len)) > + goto error; > + ret = cpu_op_pin_pages( > + (unsigned long)op->u.compare_op.b, > + op->len, pin_pages, 0); > + if (ret) > + goto error; > + break; > + case CPU_MEMCPY_OP: > + ret = -EFAULT; > + expect_fault = op->u.memcpy_op.expect_fault_dst; > + if (!access_ok(VERIFY_WRITE, > + (void __user *)op->u.memcpy_op.dst, > + op->len)) > + goto error; > + ret = cpu_op_pin_pages( > + (unsigned long)op->u.memcpy_op.dst, > + op->len, pin_pages, 1); > + if (ret) > + goto error; > + ret = -EFAULT; > + expect_fault = op->u.memcpy_op.expect_fault_src; > + if (!access_ok(VERIFY_READ, > + (void __user *)op->u.memcpy_op.src, > + op->len)) > + goto error; > + ret = cpu_op_pin_pages( > + (unsigned long)op->u.memcpy_op.src, > + op->len, pin_pages, 0); > + if (ret) > + goto error; > + break; > + case CPU_ADD_OP: > + ret = -EFAULT; > + expect_fault = op->u.arithmetic_op.expect_fault_p; > + if (!access_ok(VERIFY_WRITE, > + (void __user *)op->u.arithmetic_op.p, > + op->len)) > + goto error; > + ret = cpu_op_pin_pages( > + (unsigned long)op->u.arithmetic_op.p, > + op->len, pin_pages, 1); > + if (ret) > + goto error; > + break; > + case CPU_OR_OP: > + case CPU_AND_OP: > + case CPU_XOR_OP: > + ret = -EFAULT; > + expect_fault = op->u.bitwise_op.expect_fault_p; > + if (!access_ok(VERIFY_WRITE, > + (void __user *)op->u.bitwise_op.p, > + op->len)) > + goto error; > + ret = cpu_op_pin_pages( > + (unsigned long)op->u.bitwise_op.p, > + op->len, pin_pages, 1); > + if (ret) > + goto error; > + break; > + case CPU_LSHIFT_OP: > + case CPU_RSHIFT_OP: > + ret = -EFAULT; > + expect_fault = op->u.shift_op.expect_fault_p; > + if (!access_ok(VERIFY_WRITE, > + (void __user *)op->u.shift_op.p, > + op->len)) > + goto error; > + ret = cpu_op_pin_pages( > + (unsigned long)op->u.shift_op.p, > + op->len, pin_pages, 1); > + if (ret) > + goto error; > + break; > + case CPU_MB_OP: > + break; > + default: > + return -EINVAL; > + } > + } > + return 0; > + > +error: > + for (i = 0; i < pin_pages->nr; i++) > + put_page(pin_pages->pages[i]); > + pin_pages->nr = 0; > + /* > + * If faulting access is expected, return EAGAIN to user-space. > + * It allows user-space to distinguish between a fault caused by > + * an access which is expect to fault (e.g. due to concurrent > + * unmapping of underlying memory) from an unexpected fault from > + * which a retry would not recover. > + */ > + if (ret == -EFAULT && expect_fault) > + return -EAGAIN; > + return ret; > +} > + > +/* Return 0 if same, > 0 if different, < 0 on error. */ > +static int do_cpu_op_compare_iter(void __user *a, void __user *b, uint32_t len) > +{ > + char bufa[TMP_BUFLEN], bufb[TMP_BUFLEN]; > + uint32_t compared = 0; > + > + while (compared != len) { > + unsigned long to_compare; > + > + to_compare = min_t(uint32_t, TMP_BUFLEN, len - compared); > + if (__copy_from_user_inatomic(bufa, a + compared, to_compare)) > + return -EFAULT; > + if (__copy_from_user_inatomic(bufb, b + compared, to_compare)) > + return -EFAULT; > + if (memcmp(bufa, bufb, to_compare)) > + return 1; /* different */ > + compared += to_compare; > + } > + return 0; /* same */ > +} > + > +/* Return 0 if same, > 0 if different, < 0 on error. */ > +static int do_cpu_op_compare(void __user *a, void __user *b, uint32_t len) > +{ > + int ret = -EFAULT; > + union { > + uint8_t _u8; > + uint16_t _u16; > + uint32_t _u32; > + uint64_t _u64; > +#if (BITS_PER_LONG < 64) > + uint32_t _u64_split[2]; > +#endif > + } tmp[2]; > + > + pagefault_disable(); > + switch (len) { > + case 1: > + if (__get_user(tmp[0]._u8, (uint8_t __user *)a)) > + goto end; > + if (__get_user(tmp[1]._u8, (uint8_t __user *)b)) > + goto end; > + ret = !!(tmp[0]._u8 != tmp[1]._u8); > + break; > + case 2: > + if (__get_user(tmp[0]._u16, (uint16_t __user *)a)) > + goto end; > + if (__get_user(tmp[1]._u16, (uint16_t __user *)b)) > + goto end; > + ret = !!(tmp[0]._u16 != tmp[1]._u16); > + break; > + case 4: > + if (__get_user(tmp[0]._u32, (uint32_t __user *)a)) > + goto end; > + if (__get_user(tmp[1]._u32, (uint32_t __user *)b)) > + goto end; > + ret = !!(tmp[0]._u32 != tmp[1]._u32); > + break; > + case 8: > +#if (BITS_PER_LONG >= 64) > + if (__get_user(tmp[0]._u64, (uint64_t __user *)a)) > + goto end; > + if (__get_user(tmp[1]._u64, (uint64_t __user *)b)) > + goto end; > +#else > + if (__get_user(tmp[0]._u64_split[0], (uint32_t __user *)a)) > + goto end; > + if (__get_user(tmp[0]._u64_split[1], (uint32_t __user *)a + 1)) > + goto end; > + if (__get_user(tmp[1]._u64_split[0], (uint32_t __user *)b)) > + goto end; > + if (__get_user(tmp[1]._u64_split[1], (uint32_t __user *)b + 1)) > + goto end; > +#endif > + ret = !!(tmp[0]._u64 != tmp[1]._u64); > + break; > + default: > + pagefault_enable(); > + return do_cpu_op_compare_iter(a, b, len); > + } > +end: > + pagefault_enable(); > + return ret; > +} > + > +/* Return 0 on success, < 0 on error. */ > +static int do_cpu_op_memcpy_iter(void __user *dst, void __user *src, > + uint32_t len) > +{ > + char buf[TMP_BUFLEN]; > + uint32_t copied = 0; > + > + while (copied != len) { > + unsigned long to_copy; > + > + to_copy = min_t(uint32_t, TMP_BUFLEN, len - copied); > + if (__copy_from_user_inatomic(buf, src + copied, to_copy)) > + return -EFAULT; > + if (__copy_to_user_inatomic(dst + copied, buf, to_copy)) > + return -EFAULT; > + copied += to_copy; > + } > + return 0; > +} > + > +/* Return 0 on success, < 0 on error. */ > +static int do_cpu_op_memcpy(void __user *dst, void __user *src, uint32_t len) > +{ > + int ret = -EFAULT; > + union { > + uint8_t _u8; > + uint16_t _u16; > + uint32_t _u32; > + uint64_t _u64; > +#if (BITS_PER_LONG < 64) > + uint32_t _u64_split[2]; > +#endif > + } tmp; > + > + pagefault_disable(); > + switch (len) { > + case 1: > + if (__get_user(tmp._u8, (uint8_t __user *)src)) > + goto end; > + if (__put_user(tmp._u8, (uint8_t __user *)dst)) > + goto end; > + break; > + case 2: > + if (__get_user(tmp._u16, (uint16_t __user *)src)) > + goto end; > + if (__put_user(tmp._u16, (uint16_t __user *)dst)) > + goto end; > + break; > + case 4: > + if (__get_user(tmp._u32, (uint32_t __user *)src)) > + goto end; > + if (__put_user(tmp._u32, (uint32_t __user *)dst)) > + goto end; > + break; > + case 8: > +#if (BITS_PER_LONG >= 64) > + if (__get_user(tmp._u64, (uint64_t __user *)src)) > + goto end; > + if (__put_user(tmp._u64, (uint64_t __user *)dst)) > + goto end; > +#else > + if (__get_user(tmp._u64_split[0], (uint32_t __user *)src)) > + goto end; > + if (__get_user(tmp._u64_split[1], (uint32_t __user *)src + 1)) > + goto end; > + if (__put_user(tmp._u64_split[0], (uint32_t __user *)dst)) > + goto end; > + if (__put_user(tmp._u64_split[1], (uint32_t __user *)dst + 1)) > + goto end; > +#endif > + break; > + default: > + pagefault_enable(); > + return do_cpu_op_memcpy_iter(dst, src, len); > + } > + ret = 0; > +end: > + pagefault_enable(); > + return ret; > +} > + > +static int op_add_fn(union op_fn_data *data, uint64_t count, uint32_t len) > +{ > + int ret = 0; > + > + switch (len) { > + case 1: > + data->_u8 += (uint8_t)count; > + break; > + case 2: > + data->_u16 += (uint16_t)count; > + break; > + case 4: > + data->_u32 += (uint32_t)count; > + break; > + case 8: > + data->_u64 += (uint64_t)count; > + break; > + default: > + ret = -EINVAL; > + break; > + } > + return ret; > +} > + > +static int op_or_fn(union op_fn_data *data, uint64_t mask, uint32_t len) > +{ > + int ret = 0; > + > + switch (len) { > + case 1: > + data->_u8 |= (uint8_t)mask; > + break; > + case 2: > + data->_u16 |= (uint16_t)mask; > + break; > + case 4: > + data->_u32 |= (uint32_t)mask; > + break; > + case 8: > + data->_u64 |= (uint64_t)mask; > + break; > + default: > + ret = -EINVAL; > + break; > + } > + return ret; > +} > + > +static int op_and_fn(union op_fn_data *data, uint64_t mask, uint32_t len) > +{ > + int ret = 0; > + > + switch (len) { > + case 1: > + data->_u8 &= (uint8_t)mask; > + break; > + case 2: > + data->_u16 &= (uint16_t)mask; > + break; > + case 4: > + data->_u32 &= (uint32_t)mask; > + break; > + case 8: > + data->_u64 &= (uint64_t)mask; > + break; > + default: > + ret = -EINVAL; > + break; > + } > + return ret; > +} > + > +static int op_xor_fn(union op_fn_data *data, uint64_t mask, uint32_t len) > +{ > + int ret = 0; > + > + switch (len) { > + case 1: > + data->_u8 ^= (uint8_t)mask; > + break; > + case 2: > + data->_u16 ^= (uint16_t)mask; > + break; > + case 4: > + data->_u32 ^= (uint32_t)mask; > + break; > + case 8: > + data->_u64 ^= (uint64_t)mask; > + break; > + default: > + ret = -EINVAL; > + break; > + } > + return ret; > +} > + > +static int op_lshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len) > +{ > + int ret = 0; > + > + switch (len) { > + case 1: > + data->_u8 <<= (uint8_t)bits; > + break; > + case 2: > + data->_u16 <<= (uint16_t)bits; > + break; > + case 4: > + data->_u32 <<= (uint32_t)bits; > + break; > + case 8: > + data->_u64 <<= (uint64_t)bits; > + break; > + default: > + ret = -EINVAL; > + break; > + } > + return ret; > +} > + > +static int op_rshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len) > +{ > + int ret = 0; > + > + switch (len) { > + case 1: > + data->_u8 >>= (uint8_t)bits; > + break; > + case 2: > + data->_u16 >>= (uint16_t)bits; > + break; > + case 4: > + data->_u32 >>= (uint32_t)bits; > + break; > + case 8: > + data->_u64 >>= (uint64_t)bits; > + break; > + default: > + ret = -EINVAL; > + break; > + } > + return ret; > +} > + > +/* Return 0 on success, < 0 on error. */ > +static int do_cpu_op_fn(op_fn_t op_fn, void __user *p, uint64_t v, > + uint32_t len) > +{ > + int ret = -EFAULT; > + union op_fn_data tmp; > + > + pagefault_disable(); > + switch (len) { > + case 1: > + if (__get_user(tmp._u8, (uint8_t __user *)p)) > + goto end; > + if (op_fn(&tmp, v, len)) > + goto end; > + if (__put_user(tmp._u8, (uint8_t __user *)p)) > + goto end; > + break; > + case 2: > + if (__get_user(tmp._u16, (uint16_t __user *)p)) > + goto end; > + if (op_fn(&tmp, v, len)) > + goto end; > + if (__put_user(tmp._u16, (uint16_t __user *)p)) > + goto end; > + break; > + case 4: > + if (__get_user(tmp._u32, (uint32_t __user *)p)) > + goto end; > + if (op_fn(&tmp, v, len)) > + goto end; > + if (__put_user(tmp._u32, (uint32_t __user *)p)) > + goto end; > + break; > + case 8: > +#if (BITS_PER_LONG >= 64) > + if (__get_user(tmp._u64, (uint64_t __user *)p)) > + goto end; > +#else > + if (__get_user(tmp._u64_split[0], (uint32_t __user *)p)) > + goto end; > + if (__get_user(tmp._u64_split[1], (uint32_t __user *)p + 1)) > + goto end; > +#endif > + if (op_fn(&tmp, v, len)) > + goto end; > +#if (BITS_PER_LONG >= 64) > + if (__put_user(tmp._u64, (uint64_t __user *)p)) > + goto end; > +#else > + if (__put_user(tmp._u64_split[0], (uint32_t __user *)p)) > + goto end; > + if (__put_user(tmp._u64_split[1], (uint32_t __user *)p + 1)) > + goto end; > +#endif > + break; > + default: > + ret = -EINVAL; > + goto end; > + } > + ret = 0; > +end: > + pagefault_enable(); > + return ret; > +} > + > +static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt) > +{ > + int i, ret; > + > + for (i = 0; i < cpuopcnt; i++) { > + struct cpu_op *op = &cpuop[i]; > + > + /* Guarantee a compiler barrier between each operation. */ > + barrier(); > + > + switch (op->op) { > + case CPU_COMPARE_EQ_OP: > + ret = do_cpu_op_compare( > + (void __user *)op->u.compare_op.a, > + (void __user *)op->u.compare_op.b, > + op->len); > + /* Stop execution on error. */ > + if (ret < 0) > + return ret; > + /* > + * Stop execution, return op index + 1 if comparison > + * differs. > + */ > + if (ret > 0) > + return i + 1; > + break; > + case CPU_COMPARE_NE_OP: > + ret = do_cpu_op_compare( > + (void __user *)op->u.compare_op.a, > + (void __user *)op->u.compare_op.b, > + op->len); > + /* Stop execution on error. */ > + if (ret < 0) > + return ret; > + /* > + * Stop execution, return op index + 1 if comparison > + * is identical. > + */ > + if (ret == 0) > + return i + 1; > + break; > + case CPU_MEMCPY_OP: > + ret = do_cpu_op_memcpy( > + (void __user *)op->u.memcpy_op.dst, > + (void __user *)op->u.memcpy_op.src, > + op->len); > + /* Stop execution on error. */ > + if (ret) > + return ret; > + break; > + case CPU_ADD_OP: > + ret = do_cpu_op_fn(op_add_fn, > + (void __user *)op->u.arithmetic_op.p, > + op->u.arithmetic_op.count, op->len); > + /* Stop execution on error. */ > + if (ret) > + return ret; > + break; > + case CPU_OR_OP: > + ret = do_cpu_op_fn(op_or_fn, > + (void __user *)op->u.bitwise_op.p, > + op->u.bitwise_op.mask, op->len); > + /* Stop execution on error. */ > + if (ret) > + return ret; > + break; > + case CPU_AND_OP: > + ret = do_cpu_op_fn(op_and_fn, > + (void __user *)op->u.bitwise_op.p, > + op->u.bitwise_op.mask, op->len); > + /* Stop execution on error. */ > + if (ret) > + return ret; > + break; > + case CPU_XOR_OP: > + ret = do_cpu_op_fn(op_xor_fn, > + (void __user *)op->u.bitwise_op.p, > + op->u.bitwise_op.mask, op->len); > + /* Stop execution on error. */ > + if (ret) > + return ret; > + break; > + case CPU_LSHIFT_OP: > + ret = do_cpu_op_fn(op_lshift_fn, > + (void __user *)op->u.shift_op.p, > + op->u.shift_op.bits, op->len); > + /* Stop execution on error. */ > + if (ret) > + return ret; > + break; > + case CPU_RSHIFT_OP: > + ret = do_cpu_op_fn(op_rshift_fn, > + (void __user *)op->u.shift_op.p, > + op->u.shift_op.bits, op->len); > + /* Stop execution on error. */ > + if (ret) > + return ret; > + break; > + case CPU_MB_OP: > + smp_mb(); > + break; > + default: > + return -EINVAL; > + } > + } > + return 0; > +} > + > +static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt, int cpu) > +{ > + int ret; > + > + if (cpu != raw_smp_processor_id()) { > + ret = push_task_to_cpu(current, cpu); > + if (ret) > + goto check_online; > + } > + preempt_disable(); > + if (cpu != smp_processor_id()) { > + ret = -EAGAIN; > + goto end; > + } > + ret = __do_cpu_opv(cpuop, cpuopcnt); > +end: > + preempt_enable(); > + return ret; > + > +check_online: > + if (!cpu_possible(cpu)) > + return -EINVAL; > + get_online_cpus(); > + if (cpu_online(cpu)) { > + ret = -EAGAIN; > + goto put_online_cpus; > + } > + /* > + * CPU is offline. Perform operation from the current CPU with > + * cpu_online read lock held, preventing that CPU from coming online, > + * and with mutex held, providing mutual exclusion against other > + * CPUs also finding out about an offline CPU. > + */ > + mutex_lock(&cpu_opv_offline_lock); > + ret = __do_cpu_opv(cpuop, cpuopcnt); > + mutex_unlock(&cpu_opv_offline_lock); > +put_online_cpus: > + put_online_cpus(); > + return ret; > +} > + > +/* > + * cpu_opv - execute operation vector on a given CPU with preempt off. > + * > + * Userspace should pass current CPU number as parameter. May fail with > + * -EAGAIN if currently executing on the wrong CPU. > + */ > +SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt, > + int, cpu, int, flags) > +{ > + struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX]; > + struct page *pinned_pages_on_stack[NR_PINNED_PAGES_ON_STACK]; > + struct cpu_opv_pinned_pages pin_pages = { > + .pages = pinned_pages_on_stack, > + .nr = 0, > + .is_kmalloc = false, > + }; > + int ret, i; > + > + if (unlikely(flags)) > + return -EINVAL; > + if (unlikely(cpu < 0)) > + return -EINVAL; > + if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX) > + return -EINVAL; > + if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op))) > + return -EFAULT; > + ret = cpu_opv_check(cpuopv, cpuopcnt); > + if (ret) > + return ret; > + ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &pin_pages); > + if (ret) > + goto end; > + ret = do_cpu_opv(cpuopv, cpuopcnt, cpu); > + for (i = 0; i < pin_pages.nr; i++) > + put_page(pin_pages.pages[i]); > +end: > + if (pin_pages.is_kmalloc) > + kfree(pin_pages.pages); > + return ret; > +} > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 6bba05f47e51..e547f93a46c2 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1052,6 +1052,43 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) > set_curr_task(rq, p); > } > > +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu) > +{ > + struct rq_flags rf; > + struct rq *rq; > + int ret = 0; > + > + rq = task_rq_lock(p, &rf); > + update_rq_clock(rq); > + > + if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) { > + ret = -EINVAL; > + goto out; > + } > + > + if (task_cpu(p) == dest_cpu) > + goto out; > + > + if (task_running(rq, p) || p->state == TASK_WAKING) { > + struct migration_arg arg = { p, dest_cpu }; > + /* Need help from migration thread: drop lock and wait. */ > + task_rq_unlock(rq, p, &rf); > + stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg); > + tlb_migrate_finish(p->mm); > + return 0; > + } else if (task_on_rq_queued(p)) { > + /* > + * OK, since we're going to drop the lock immediately > + * afterwards anyway. > + */ > + rq = move_queued_task(rq, &rf, p, dest_cpu); > + } > +out: > + task_rq_unlock(rq, p, &rf); > + > + return ret; > +} > + > /* > * Change a given task's CPU affinity. Migrate the thread to a > * proper CPU and schedule it away if the CPU it's executing on > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 3b448ba82225..cab256c1720a 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1209,6 +1209,8 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu) > #endif > } > > +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu); > + > /* > * Tunables that become constants when CONFIG_SCHED_DEBUG is off: > */ > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index bfa1ee1bf669..59e622296dc3 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -262,3 +262,4 @@ cond_syscall(sys_pkey_free); > > /* restartable sequence */ > cond_syscall(sys_rseq); > +cond_syscall(sys_cpu_opv); > -- > 2.11.0 > > > -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html