----- On Nov 15, 2017, at 2:44 AM, Michael Kerrisk mtk.manpages@xxxxxxxxx wrote: > Hi Matthieu > > On 14 November 2017 at 21:03, Mathieu Desnoyers > <mathieu.desnoyers@xxxxxxxxxxxx> wrote: >> This new cpu_opv system call executes a vector of operations on behalf >> of user-space on a specific CPU with preemption disabled. It is inspired >> from readv() and writev() system calls which take a "struct iovec" array >> as argument. > > Do you have a man page for this syscall already? Hi Michael, It's the next thing on my roadmap when the syscall reaches mainline. That and membarrier commands man pages updates. Thanks, Mathieu > > Thanks, > > Michael > > >> The operations available are: comparison, memcpy, add, or, and, xor, >> left shift, right shift, and mb. The system call receives a CPU number >> from user-space as argument, which is the CPU on which those operations >> need to be performed. All preparation steps such as loading pointers, >> and applying offsets to arrays, need to be performed by user-space >> before invoking the system call. The "comparison" operation can be used >> to check that the data used in the preparation step did not change >> between preparation of system call inputs and operation execution within >> the preempt-off critical section. >> >> The reason why we require all pointer offsets to be calculated by >> user-space beforehand is because we need to use get_user_pages_fast() to >> first pin all pages touched by each operation. This takes care of >> faulting-in the pages. Then, preemption is disabled, and the operations >> are performed atomically with respect to other thread execution on that >> CPU, without generating any page fault. >> >> A maximum limit of 16 operations per cpu_opv syscall invocation is >> enforced, so user-space cannot generate a too long preempt-off critical >> section. Each operation is also limited a length of PAGE_SIZE bytes, >> meaning that an operation can touch a maximum of 4 pages (memcpy: 2 >> pages for source, 2 pages for destination if addresses are not aligned >> on page boundaries). Moreover, a total limit of 4216 bytes is applied >> to operation lengths. >> >> If the thread is not running on the requested CPU, a new >> push_task_to_cpu() is invoked to migrate the task to the requested CPU. >> If the requested CPU is not part of the cpus allowed mask of the thread, >> the system call fails with EINVAL. After the migration has been >> performed, preemption is disabled, and the current CPU number is checked >> again and compared to the requested CPU number. If it still differs, it >> means the scheduler migrated us away from that CPU. Return EAGAIN to >> user-space in that case, and let user-space retry (either requesting the >> same CPU number, or a different one, depending on the user-space >> algorithm constraints). >> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> >> CC: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> >> CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx> >> CC: Paul Turner <pjt@xxxxxxxxxx> >> CC: Thomas Gleixner <tglx@xxxxxxxxxxxxx> >> CC: Andrew Hunter <ahh@xxxxxxxxxx> >> CC: Andy Lutomirski <luto@xxxxxxxxxxxxxx> >> CC: Andi Kleen <andi@xxxxxxxxxxxxxx> >> CC: Dave Watson <davejwatson@xxxxxx> >> CC: Chris Lameter <cl@xxxxxxxxx> >> CC: Ingo Molnar <mingo@xxxxxxxxxx> >> CC: "H. Peter Anvin" <hpa@xxxxxxxxx> >> CC: Ben Maurer <bmaurer@xxxxxx> >> CC: Steven Rostedt <rostedt@xxxxxxxxxxx> >> CC: Josh Triplett <josh@xxxxxxxxxxxxxxxx> >> CC: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> >> CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> >> CC: Russell King <linux@xxxxxxxxxxxxxxxx> >> CC: Catalin Marinas <catalin.marinas@xxxxxxx> >> CC: Will Deacon <will.deacon@xxxxxxx> >> CC: Michael Kerrisk <mtk.manpages@xxxxxxxxx> >> CC: Boqun Feng <boqun.feng@xxxxxxxxx> >> CC: linux-api@xxxxxxxxxxxxxxx >> --- >> >> Changes since v1: >> - handle CPU hotplug, >> - cleanup implementation using function pointers: We can use function >> pointers to implement the operations rather than duplicating all the >> user-access code. >> - refuse device pages: Performing cpu_opv operations on io map'd pages >> with preemption disabled could generate long preempt-off critical >> sections, which leads to unwanted scheduler latency. Return EFAULT if >> a device page is received as parameter >> - restrict op vector to 4216 bytes length sum: Restrict the operation >> vector to length sum of: >> - 4096 bytes (typical page size on most architectures, should be >> enough for a string, or structures) >> - 15 * 8 bytes (typical operations on integers or pointers). >> The goal here is to keep the duration of preempt off critical section >> short, so we don't add significant scheduler latency. >> - Add INIT_ONSTACK macro: Introduce the >> CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users >> correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their >> stack to 0 on 32-bit architectures. >> - Add CPU_MB_OP operation: >> Use-cases with: >> - two consecutive stores, >> - a mempcy followed by a store, >> require a memory barrier before the final store operation. A typical >> use-case is a store-release on the final store. Given that this is a >> slow path, just providing an explicit full barrier instruction should >> be sufficient. >> - Add expect fault field: >> The use-case of list_pop brings interesting challenges. With rseq, we >> can use rseq_cmpnev_storeoffp_load(), and therefore load a pointer, >> compare it against NULL, add an offset, and load the target "next" >> pointer from the object, all within a single req critical section. >> >> Life is not so easy for cpu_opv in this use-case, mainly because we >> need to pin all pages we are going to touch in the preempt-off >> critical section beforehand. So we need to know the target object (in >> which we apply an offset to fetch the next pointer) when we pin pages >> before disabling preemption. >> >> So the approach is to load the head pointer and compare it against >> NULL in user-space, before doing the cpu_opv syscall. User-space can >> then compute the address of the head->next field, *without loading it*. >> >> The cpu_opv system call will first need to pin all pages associated >> with input data. This includes the page backing the head->next object, >> which may have been concurrently deallocated and unmapped. Therefore, >> in this case, getting -EFAULT when trying to pin those pages may >> happen: it just means they have been concurrently unmapped. This is >> an expected situation, and should just return -EAGAIN to user-space, >> to user-space can distinguish between "should retry" type of >> situations and actual errors that should be handled with extreme >> prejudice to the program (e.g. abort()). >> >> Therefore, add "expect_fault" fields along with op input address >> pointers, so user-space can identify whether a fault when getting a >> field should return EAGAIN rather than EFAULT. >> - Add compiler barrier between operations: Adding a compiler barrier >> between store operations in a cpu_opv sequence can be useful when >> paired with membarrier system call. >> >> An algorithm with a paired slow path and fast path can use >> sys_membarrier on the slow path to replace fast-path memory barriers >> by compiler barrier. >> >> Adding an explicit compiler barrier between operations allows >> cpu_opv to be used as fallback for operations meant to match >> the membarrier system call. >> >> Changes since v2: >> >> - Fix memory leak by introducing struct cpu_opv_pinned_pages. >> Suggested by Boqun Feng. >> - Cast argument 1 passed to access_ok from integer to void __user *, >> fixing sparse warning. >> --- >> MAINTAINERS | 7 + >> include/uapi/linux/cpu_opv.h | 117 ++++++ >> init/Kconfig | 14 + >> kernel/Makefile | 1 + >> kernel/cpu_opv.c | 968 +++++++++++++++++++++++++++++++++++++++++++ >> kernel/sched/core.c | 37 ++ >> kernel/sched/sched.h | 2 + >> kernel/sys_ni.c | 1 + >> 8 files changed, 1147 insertions(+) >> create mode 100644 include/uapi/linux/cpu_opv.h >> create mode 100644 kernel/cpu_opv.c >> >> diff --git a/MAINTAINERS b/MAINTAINERS >> index c9f95f8b07ed..45a1bbdaa287 100644 >> --- a/MAINTAINERS >> +++ b/MAINTAINERS >> @@ -3675,6 +3675,13 @@ B: https://bugzilla.kernel.org >> F: drivers/cpuidle/* >> F: include/linux/cpuidle.h >> >> +CPU NON-PREEMPTIBLE OPERATION VECTOR SUPPORT >> +M: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> >> +L: linux-kernel@xxxxxxxxxxxxxxx >> +S: Supported >> +F: kernel/cpu_opv.c >> +F: include/uapi/linux/cpu_opv.h >> + >> CRAMFS FILESYSTEM >> W: http://sourceforge.net/projects/cramfs/ >> S: Orphan / Obsolete >> diff --git a/include/uapi/linux/cpu_opv.h b/include/uapi/linux/cpu_opv.h >> new file mode 100644 >> index 000000000000..17f7d46e053b >> --- /dev/null >> +++ b/include/uapi/linux/cpu_opv.h >> @@ -0,0 +1,117 @@ >> +#ifndef _UAPI_LINUX_CPU_OPV_H >> +#define _UAPI_LINUX_CPU_OPV_H >> + >> +/* >> + * linux/cpu_opv.h >> + * >> + * CPU preempt-off operation vector system call API >> + * >> + * Copyright (c) 2017 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> >> + * >> + * Permission is hereby granted, free of charge, to any person obtaining a copy >> + * of this software and associated documentation files (the "Software"), to >> deal >> + * in the Software without restriction, including without limitation the rights >> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell >> + * copies of the Software, and to permit persons to whom the Software is >> + * furnished to do so, subject to the following conditions: >> + * >> + * The above copyright notice and this permission notice shall be included in >> + * all copies or substantial portions of the Software. >> + * >> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR >> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, >> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE >> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER >> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING >> FROM, >> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN >> THE >> + * SOFTWARE. >> + */ >> + >> +#ifdef __KERNEL__ >> +# include <linux/types.h> >> +#else /* #ifdef __KERNEL__ */ >> +# include <stdint.h> >> +#endif /* #else #ifdef __KERNEL__ */ >> + >> +#include <asm/byteorder.h> >> + >> +#ifdef __LP64__ >> +# define CPU_OP_FIELD_u32_u64(field) uint64_t field >> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v >> +#elif defined(__BYTE_ORDER) ? \ >> + __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN) >> +# define CPU_OP_FIELD_u32_u64(field) uint32_t field ## _padding, field >> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) \ >> + field ## _padding = 0, field = (intptr_t)v >> +#else >> +# define CPU_OP_FIELD_u32_u64(field) uint32_t field, field ## _padding >> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v) \ >> + field = (intptr_t)v, field ## _padding = 0 >> +#endif >> + >> +#define CPU_OP_VEC_LEN_MAX 16 >> +#define CPU_OP_ARG_LEN_MAX 24 >> +/* Max. data len per operation. */ >> +#define CPU_OP_DATA_LEN_MAX PAGE_SIZE >> +/* >> + * Max. data len for overall vector. We to restrict the amount of >> + * user-space data touched by the kernel in non-preemptible context so >> + * we do not introduce long scheduler latencies. >> + * This allows one copy of up to 4096 bytes, and 15 operations touching >> + * 8 bytes each. >> + * This limit is applied to the sum of length specified for all >> + * operations in a vector. >> + */ >> +#define CPU_OP_VEC_DATA_LEN_MAX (4096 + 15*8) >> +#define CPU_OP_MAX_PAGES 4 /* Max. pages per op. */ >> + >> +enum cpu_op_type { >> + CPU_COMPARE_EQ_OP, /* compare */ >> + CPU_COMPARE_NE_OP, /* compare */ >> + CPU_MEMCPY_OP, /* memcpy */ >> + CPU_ADD_OP, /* arithmetic */ >> + CPU_OR_OP, /* bitwise */ >> + CPU_AND_OP, /* bitwise */ >> + CPU_XOR_OP, /* bitwise */ >> + CPU_LSHIFT_OP, /* shift */ >> + CPU_RSHIFT_OP, /* shift */ >> + CPU_MB_OP, /* memory barrier */ >> +}; >> + >> +/* Vector of operations to perform. Limited to 16. */ >> +struct cpu_op { >> + int32_t op; /* enum cpu_op_type. */ >> + uint32_t len; /* data length, in bytes. */ >> + union { >> + struct { >> + CPU_OP_FIELD_u32_u64(a); >> + CPU_OP_FIELD_u32_u64(b); >> + uint8_t expect_fault_a; >> + uint8_t expect_fault_b; >> + } compare_op; >> + struct { >> + CPU_OP_FIELD_u32_u64(dst); >> + CPU_OP_FIELD_u32_u64(src); >> + uint8_t expect_fault_dst; >> + uint8_t expect_fault_src; >> + } memcpy_op; >> + struct { >> + CPU_OP_FIELD_u32_u64(p); >> + int64_t count; >> + uint8_t expect_fault_p; >> + } arithmetic_op; >> + struct { >> + CPU_OP_FIELD_u32_u64(p); >> + uint64_t mask; >> + uint8_t expect_fault_p; >> + } bitwise_op; >> + struct { >> + CPU_OP_FIELD_u32_u64(p); >> + uint32_t bits; >> + uint8_t expect_fault_p; >> + } shift_op; >> + char __padding[CPU_OP_ARG_LEN_MAX]; >> + } u; >> +}; >> + >> +#endif /* _UAPI_LINUX_CPU_OPV_H */ >> diff --git a/init/Kconfig b/init/Kconfig >> index cbedfb91b40a..e4fbb5dd6a24 100644 >> --- a/init/Kconfig >> +++ b/init/Kconfig >> @@ -1404,6 +1404,7 @@ config RSEQ >> bool "Enable rseq() system call" if EXPERT >> default y >> depends on HAVE_RSEQ >> + select CPU_OPV >> select MEMBARRIER >> help >> Enable the restartable sequences system call. It provides a >> @@ -1414,6 +1415,19 @@ config RSEQ >> >> If unsure, say Y. >> >> +config CPU_OPV >> + bool "Enable cpu_opv() system call" if EXPERT >> + default y >> + help >> + Enable the CPU preempt-off operation vector system call. >> + It allows user-space to perform a sequence of operations on >> + per-cpu data with preemption disabled. Useful as >> + single-stepping fall-back for restartable sequences, and for >> + performing more complex operations on per-cpu data that would >> + not be otherwise possible to do with restartable sequences. >> + >> + If unsure, say Y. >> + >> config EMBEDDED >> bool "Embedded system" >> option allnoconfig_y >> diff --git a/kernel/Makefile b/kernel/Makefile >> index 3574669dafd9..cac8855196ff 100644 >> --- a/kernel/Makefile >> +++ b/kernel/Makefile >> @@ -113,6 +113,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o >> >> obj-$(CONFIG_HAS_IOMEM) += memremap.o >> obj-$(CONFIG_RSEQ) += rseq.o >> +obj-$(CONFIG_CPU_OPV) += cpu_opv.o >> >> $(obj)/configs.o: $(obj)/config_data.h >> >> diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c >> new file mode 100644 >> index 000000000000..a81837a14b17 >> --- /dev/null >> +++ b/kernel/cpu_opv.c >> @@ -0,0 +1,968 @@ >> +/* >> + * CPU preempt-off operation vector system call >> + * >> + * It allows user-space to perform a sequence of operations on per-cpu >> + * data with preemption disabled. Useful as single-stepping fall-back >> + * for restartable sequences, and for performing more complex operations >> + * on per-cpu data that would not be otherwise possible to do with >> + * restartable sequences. >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of the GNU General Public License as published by >> + * the Free Software Foundation; either version 2 of the License, or >> + * (at your option) any later version. >> + * >> + * This program is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >> + * GNU General Public License for more details. >> + * >> + * Copyright (C) 2017, EfficiOS Inc., >> + * Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> >> + */ >> + >> +#include <linux/sched.h> >> +#include <linux/uaccess.h> >> +#include <linux/syscalls.h> >> +#include <linux/cpu_opv.h> >> +#include <linux/types.h> >> +#include <linux/mutex.h> >> +#include <linux/pagemap.h> >> +#include <asm/ptrace.h> >> +#include <asm/byteorder.h> >> + >> +#include "sched/sched.h" >> + >> +#define TMP_BUFLEN 64 >> +#define NR_PINNED_PAGES_ON_STACK 8 >> + >> +union op_fn_data { >> + uint8_t _u8; >> + uint16_t _u16; >> + uint32_t _u32; >> + uint64_t _u64; >> +#if (BITS_PER_LONG < 64) >> + uint32_t _u64_split[2]; >> +#endif >> +}; >> + >> +struct cpu_opv_pinned_pages { >> + struct page **pages; >> + size_t nr; >> + bool is_kmalloc; >> +}; >> + >> +typedef int (*op_fn_t)(union op_fn_data *data, uint64_t v, uint32_t len); >> + >> +static DEFINE_MUTEX(cpu_opv_offline_lock); >> + >> +/* >> + * The cpu_opv system call executes a vector of operations on behalf of >> + * user-space on a specific CPU with preemption disabled. It is inspired >> + * from readv() and writev() system calls which take a "struct iovec" >> + * array as argument. >> + * >> + * The operations available are: comparison, memcpy, add, or, and, xor, >> + * left shift, and right shift. The system call receives a CPU number >> + * from user-space as argument, which is the CPU on which those >> + * operations need to be performed. All preparation steps such as >> + * loading pointers, and applying offsets to arrays, need to be >> + * performed by user-space before invoking the system call. The >> + * "comparison" operation can be used to check that the data used in the >> + * preparation step did not change between preparation of system call >> + * inputs and operation execution within the preempt-off critical >> + * section. >> + * >> + * The reason why we require all pointer offsets to be calculated by >> + * user-space beforehand is because we need to use get_user_pages_fast() >> + * to first pin all pages touched by each operation. This takes care of >> + * faulting-in the pages. Then, preemption is disabled, and the >> + * operations are performed atomically with respect to other thread >> + * execution on that CPU, without generating any page fault. >> + * >> + * A maximum limit of 16 operations per cpu_opv syscall invocation is >> + * enforced, and a overall maximum length sum, so user-space cannot >> + * generate a too long preempt-off critical section. Each operation is >> + * also limited a length of PAGE_SIZE bytes, meaning that an operation >> + * can touch a maximum of 4 pages (memcpy: 2 pages for source, 2 pages >> + * for destination if addresses are not aligned on page boundaries). >> + * >> + * If the thread is not running on the requested CPU, a new >> + * push_task_to_cpu() is invoked to migrate the task to the requested >> + * CPU. If the requested CPU is not part of the cpus allowed mask of >> + * the thread, the system call fails with EINVAL. After the migration >> + * has been performed, preemption is disabled, and the current CPU >> + * number is checked again and compared to the requested CPU number. If >> + * it still differs, it means the scheduler migrated us away from that >> + * CPU. Return EAGAIN to user-space in that case, and let user-space >> + * retry (either requesting the same CPU number, or a different one, >> + * depending on the user-space algorithm constraints). >> + */ >> + >> +/* >> + * Check operation types and length parameters. >> + */ >> +static int cpu_opv_check(struct cpu_op *cpuop, int cpuopcnt) >> +{ >> + int i; >> + uint32_t sum = 0; >> + >> + for (i = 0; i < cpuopcnt; i++) { >> + struct cpu_op *op = &cpuop[i]; >> + >> + switch (op->op) { >> + case CPU_MB_OP: >> + break; >> + default: >> + sum += op->len; >> + } >> + switch (op->op) { >> + case CPU_COMPARE_EQ_OP: >> + case CPU_COMPARE_NE_OP: >> + case CPU_MEMCPY_OP: >> + if (op->len > CPU_OP_DATA_LEN_MAX) >> + return -EINVAL; >> + break; >> + case CPU_ADD_OP: >> + case CPU_OR_OP: >> + case CPU_AND_OP: >> + case CPU_XOR_OP: >> + switch (op->len) { >> + case 1: >> + case 2: >> + case 4: >> + case 8: >> + break; >> + default: >> + return -EINVAL; >> + } >> + break; >> + case CPU_LSHIFT_OP: >> + case CPU_RSHIFT_OP: >> + switch (op->len) { >> + case 1: >> + if (op->u.shift_op.bits > 7) >> + return -EINVAL; >> + break; >> + case 2: >> + if (op->u.shift_op.bits > 15) >> + return -EINVAL; >> + break; >> + case 4: >> + if (op->u.shift_op.bits > 31) >> + return -EINVAL; >> + break; >> + case 8: >> + if (op->u.shift_op.bits > 63) >> + return -EINVAL; >> + break; >> + default: >> + return -EINVAL; >> + } >> + break; >> + case CPU_MB_OP: >> + break; >> + default: >> + return -EINVAL; >> + } >> + } >> + if (sum > CPU_OP_VEC_DATA_LEN_MAX) >> + return -EINVAL; >> + return 0; >> +} >> + >> +static unsigned long cpu_op_range_nr_pages(unsigned long addr, >> + unsigned long len) >> +{ >> + return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1; >> +} >> + >> +static int cpu_op_check_page(struct page *page) >> +{ >> + struct address_space *mapping; >> + >> + if (is_zone_device_page(page)) >> + return -EFAULT; >> + page = compound_head(page); >> + mapping = READ_ONCE(page->mapping); >> + if (!mapping) { >> + int shmem_swizzled; >> + >> + /* >> + * Check again with page lock held to guard against >> + * memory pressure making shmem_writepage move the page >> + * from filecache to swapcache. >> + */ >> + lock_page(page); >> + shmem_swizzled = PageSwapCache(page) || page->mapping; >> + unlock_page(page); >> + if (shmem_swizzled) >> + return -EAGAIN; >> + return -EFAULT; >> + } >> + return 0; >> +} >> + >> +/* >> + * Refusing device pages, the zero page, pages in the gate area, and >> + * special mappings. Inspired from futex.c checks. >> + */ >> +static int cpu_op_check_pages(struct page **pages, >> + unsigned long nr_pages) >> +{ >> + unsigned long i; >> + >> + for (i = 0; i < nr_pages; i++) { >> + int ret; >> + >> + ret = cpu_op_check_page(pages[i]); >> + if (ret) >> + return ret; >> + } >> + return 0; >> +} >> + >> +static int cpu_op_pin_pages(unsigned long addr, unsigned long len, >> + struct cpu_opv_pinned_pages *pin_pages, int write) >> +{ >> + struct page *pages[2]; >> + int ret, nr_pages; >> + >> + if (!len) >> + return 0; >> + nr_pages = cpu_op_range_nr_pages(addr, len); >> + BUG_ON(nr_pages > 2); >> + if (!pin_pages->is_kmalloc && pin_pages->nr + nr_pages >> + > NR_PINNED_PAGES_ON_STACK) { >> + struct page **pinned_pages = >> + kzalloc(CPU_OP_VEC_LEN_MAX * CPU_OP_MAX_PAGES >> + * sizeof(struct page *), GFP_KERNEL); >> + if (!pinned_pages) >> + return -ENOMEM; >> + memcpy(pinned_pages, pin_pages->pages, >> + pin_pages->nr * sizeof(struct page *)); >> + pin_pages->pages = pinned_pages; >> + pin_pages->is_kmalloc = true; >> + } >> +again: >> + ret = get_user_pages_fast(addr, nr_pages, write, pages); >> + if (ret < nr_pages) { >> + if (ret > 0) >> + put_page(pages[0]); >> + return -EFAULT; >> + } >> + /* >> + * Refuse device pages, the zero page, pages in the gate area, >> + * and special mappings. >> + */ >> + ret = cpu_op_check_pages(pages, nr_pages); >> + if (ret == -EAGAIN) { >> + put_page(pages[0]); >> + if (nr_pages > 1) >> + put_page(pages[1]); >> + goto again; >> + } >> + if (ret) >> + goto error; >> + pin_pages->pages[pin_pages->nr++] = pages[0]; >> + if (nr_pages > 1) >> + pin_pages->pages[pin_pages->nr++] = pages[1]; >> + return 0; >> + >> +error: >> + put_page(pages[0]); >> + if (nr_pages > 1) >> + put_page(pages[1]); >> + return -EFAULT; >> +} >> + >> +static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt, >> + struct cpu_opv_pinned_pages *pin_pages) >> +{ >> + int ret, i; >> + bool expect_fault = false; >> + >> + /* Check access, pin pages. */ >> + for (i = 0; i < cpuopcnt; i++) { >> + struct cpu_op *op = &cpuop[i]; >> + >> + switch (op->op) { >> + case CPU_COMPARE_EQ_OP: >> + case CPU_COMPARE_NE_OP: >> + ret = -EFAULT; >> + expect_fault = op->u.compare_op.expect_fault_a; >> + if (!access_ok(VERIFY_READ, >> + (void __user *)op->u.compare_op.a, >> + op->len)) >> + goto error; >> + ret = cpu_op_pin_pages( >> + (unsigned long)op->u.compare_op.a, >> + op->len, pin_pages, 0); >> + if (ret) >> + goto error; >> + ret = -EFAULT; >> + expect_fault = op->u.compare_op.expect_fault_b; >> + if (!access_ok(VERIFY_READ, >> + (void __user *)op->u.compare_op.b, >> + op->len)) >> + goto error; >> + ret = cpu_op_pin_pages( >> + (unsigned long)op->u.compare_op.b, >> + op->len, pin_pages, 0); >> + if (ret) >> + goto error; >> + break; >> + case CPU_MEMCPY_OP: >> + ret = -EFAULT; >> + expect_fault = op->u.memcpy_op.expect_fault_dst; >> + if (!access_ok(VERIFY_WRITE, >> + (void __user *)op->u.memcpy_op.dst, >> + op->len)) >> + goto error; >> + ret = cpu_op_pin_pages( >> + (unsigned long)op->u.memcpy_op.dst, >> + op->len, pin_pages, 1); >> + if (ret) >> + goto error; >> + ret = -EFAULT; >> + expect_fault = op->u.memcpy_op.expect_fault_src; >> + if (!access_ok(VERIFY_READ, >> + (void __user *)op->u.memcpy_op.src, >> + op->len)) >> + goto error; >> + ret = cpu_op_pin_pages( >> + (unsigned long)op->u.memcpy_op.src, >> + op->len, pin_pages, 0); >> + if (ret) >> + goto error; >> + break; >> + case CPU_ADD_OP: >> + ret = -EFAULT; >> + expect_fault = op->u.arithmetic_op.expect_fault_p; >> + if (!access_ok(VERIFY_WRITE, >> + (void __user *)op->u.arithmetic_op.p, >> + op->len)) >> + goto error; >> + ret = cpu_op_pin_pages( >> + (unsigned long)op->u.arithmetic_op.p, >> + op->len, pin_pages, 1); >> + if (ret) >> + goto error; >> + break; >> + case CPU_OR_OP: >> + case CPU_AND_OP: >> + case CPU_XOR_OP: >> + ret = -EFAULT; >> + expect_fault = op->u.bitwise_op.expect_fault_p; >> + if (!access_ok(VERIFY_WRITE, >> + (void __user *)op->u.bitwise_op.p, >> + op->len)) >> + goto error; >> + ret = cpu_op_pin_pages( >> + (unsigned long)op->u.bitwise_op.p, >> + op->len, pin_pages, 1); >> + if (ret) >> + goto error; >> + break; >> + case CPU_LSHIFT_OP: >> + case CPU_RSHIFT_OP: >> + ret = -EFAULT; >> + expect_fault = op->u.shift_op.expect_fault_p; >> + if (!access_ok(VERIFY_WRITE, >> + (void __user *)op->u.shift_op.p, >> + op->len)) >> + goto error; >> + ret = cpu_op_pin_pages( >> + (unsigned long)op->u.shift_op.p, >> + op->len, pin_pages, 1); >> + if (ret) >> + goto error; >> + break; >> + case CPU_MB_OP: >> + break; >> + default: >> + return -EINVAL; >> + } >> + } >> + return 0; >> + >> +error: >> + for (i = 0; i < pin_pages->nr; i++) >> + put_page(pin_pages->pages[i]); >> + pin_pages->nr = 0; >> + /* >> + * If faulting access is expected, return EAGAIN to user-space. >> + * It allows user-space to distinguish between a fault caused by >> + * an access which is expect to fault (e.g. due to concurrent >> + * unmapping of underlying memory) from an unexpected fault from >> + * which a retry would not recover. >> + */ >> + if (ret == -EFAULT && expect_fault) >> + return -EAGAIN; >> + return ret; >> +} >> + >> +/* Return 0 if same, > 0 if different, < 0 on error. */ >> +static int do_cpu_op_compare_iter(void __user *a, void __user *b, uint32_t len) >> +{ >> + char bufa[TMP_BUFLEN], bufb[TMP_BUFLEN]; >> + uint32_t compared = 0; >> + >> + while (compared != len) { >> + unsigned long to_compare; >> + >> + to_compare = min_t(uint32_t, TMP_BUFLEN, len - compared); >> + if (__copy_from_user_inatomic(bufa, a + compared, to_compare)) >> + return -EFAULT; >> + if (__copy_from_user_inatomic(bufb, b + compared, to_compare)) >> + return -EFAULT; >> + if (memcmp(bufa, bufb, to_compare)) >> + return 1; /* different */ >> + compared += to_compare; >> + } >> + return 0; /* same */ >> +} >> + >> +/* Return 0 if same, > 0 if different, < 0 on error. */ >> +static int do_cpu_op_compare(void __user *a, void __user *b, uint32_t len) >> +{ >> + int ret = -EFAULT; >> + union { >> + uint8_t _u8; >> + uint16_t _u16; >> + uint32_t _u32; >> + uint64_t _u64; >> +#if (BITS_PER_LONG < 64) >> + uint32_t _u64_split[2]; >> +#endif >> + } tmp[2]; >> + >> + pagefault_disable(); >> + switch (len) { >> + case 1: >> + if (__get_user(tmp[0]._u8, (uint8_t __user *)a)) >> + goto end; >> + if (__get_user(tmp[1]._u8, (uint8_t __user *)b)) >> + goto end; >> + ret = !!(tmp[0]._u8 != tmp[1]._u8); >> + break; >> + case 2: >> + if (__get_user(tmp[0]._u16, (uint16_t __user *)a)) >> + goto end; >> + if (__get_user(tmp[1]._u16, (uint16_t __user *)b)) >> + goto end; >> + ret = !!(tmp[0]._u16 != tmp[1]._u16); >> + break; >> + case 4: >> + if (__get_user(tmp[0]._u32, (uint32_t __user *)a)) >> + goto end; >> + if (__get_user(tmp[1]._u32, (uint32_t __user *)b)) >> + goto end; >> + ret = !!(tmp[0]._u32 != tmp[1]._u32); >> + break; >> + case 8: >> +#if (BITS_PER_LONG >= 64) >> + if (__get_user(tmp[0]._u64, (uint64_t __user *)a)) >> + goto end; >> + if (__get_user(tmp[1]._u64, (uint64_t __user *)b)) >> + goto end; >> +#else >> + if (__get_user(tmp[0]._u64_split[0], (uint32_t __user *)a)) >> + goto end; >> + if (__get_user(tmp[0]._u64_split[1], (uint32_t __user *)a + 1)) >> + goto end; >> + if (__get_user(tmp[1]._u64_split[0], (uint32_t __user *)b)) >> + goto end; >> + if (__get_user(tmp[1]._u64_split[1], (uint32_t __user *)b + 1)) >> + goto end; >> +#endif >> + ret = !!(tmp[0]._u64 != tmp[1]._u64); >> + break; >> + default: >> + pagefault_enable(); >> + return do_cpu_op_compare_iter(a, b, len); >> + } >> +end: >> + pagefault_enable(); >> + return ret; >> +} >> + >> +/* Return 0 on success, < 0 on error. */ >> +static int do_cpu_op_memcpy_iter(void __user *dst, void __user *src, >> + uint32_t len) >> +{ >> + char buf[TMP_BUFLEN]; >> + uint32_t copied = 0; >> + >> + while (copied != len) { >> + unsigned long to_copy; >> + >> + to_copy = min_t(uint32_t, TMP_BUFLEN, len - copied); >> + if (__copy_from_user_inatomic(buf, src + copied, to_copy)) >> + return -EFAULT; >> + if (__copy_to_user_inatomic(dst + copied, buf, to_copy)) >> + return -EFAULT; >> + copied += to_copy; >> + } >> + return 0; >> +} >> + >> +/* Return 0 on success, < 0 on error. */ >> +static int do_cpu_op_memcpy(void __user *dst, void __user *src, uint32_t len) >> +{ >> + int ret = -EFAULT; >> + union { >> + uint8_t _u8; >> + uint16_t _u16; >> + uint32_t _u32; >> + uint64_t _u64; >> +#if (BITS_PER_LONG < 64) >> + uint32_t _u64_split[2]; >> +#endif >> + } tmp; >> + >> + pagefault_disable(); >> + switch (len) { >> + case 1: >> + if (__get_user(tmp._u8, (uint8_t __user *)src)) >> + goto end; >> + if (__put_user(tmp._u8, (uint8_t __user *)dst)) >> + goto end; >> + break; >> + case 2: >> + if (__get_user(tmp._u16, (uint16_t __user *)src)) >> + goto end; >> + if (__put_user(tmp._u16, (uint16_t __user *)dst)) >> + goto end; >> + break; >> + case 4: >> + if (__get_user(tmp._u32, (uint32_t __user *)src)) >> + goto end; >> + if (__put_user(tmp._u32, (uint32_t __user *)dst)) >> + goto end; >> + break; >> + case 8: >> +#if (BITS_PER_LONG >= 64) >> + if (__get_user(tmp._u64, (uint64_t __user *)src)) >> + goto end; >> + if (__put_user(tmp._u64, (uint64_t __user *)dst)) >> + goto end; >> +#else >> + if (__get_user(tmp._u64_split[0], (uint32_t __user *)src)) >> + goto end; >> + if (__get_user(tmp._u64_split[1], (uint32_t __user *)src + 1)) >> + goto end; >> + if (__put_user(tmp._u64_split[0], (uint32_t __user *)dst)) >> + goto end; >> + if (__put_user(tmp._u64_split[1], (uint32_t __user *)dst + 1)) >> + goto end; >> +#endif >> + break; >> + default: >> + pagefault_enable(); >> + return do_cpu_op_memcpy_iter(dst, src, len); >> + } >> + ret = 0; >> +end: >> + pagefault_enable(); >> + return ret; >> +} >> + >> +static int op_add_fn(union op_fn_data *data, uint64_t count, uint32_t len) >> +{ >> + int ret = 0; >> + >> + switch (len) { >> + case 1: >> + data->_u8 += (uint8_t)count; >> + break; >> + case 2: >> + data->_u16 += (uint16_t)count; >> + break; >> + case 4: >> + data->_u32 += (uint32_t)count; >> + break; >> + case 8: >> + data->_u64 += (uint64_t)count; >> + break; >> + default: >> + ret = -EINVAL; >> + break; >> + } >> + return ret; >> +} >> + >> +static int op_or_fn(union op_fn_data *data, uint64_t mask, uint32_t len) >> +{ >> + int ret = 0; >> + >> + switch (len) { >> + case 1: >> + data->_u8 |= (uint8_t)mask; >> + break; >> + case 2: >> + data->_u16 |= (uint16_t)mask; >> + break; >> + case 4: >> + data->_u32 |= (uint32_t)mask; >> + break; >> + case 8: >> + data->_u64 |= (uint64_t)mask; >> + break; >> + default: >> + ret = -EINVAL; >> + break; >> + } >> + return ret; >> +} >> + >> +static int op_and_fn(union op_fn_data *data, uint64_t mask, uint32_t len) >> +{ >> + int ret = 0; >> + >> + switch (len) { >> + case 1: >> + data->_u8 &= (uint8_t)mask; >> + break; >> + case 2: >> + data->_u16 &= (uint16_t)mask; >> + break; >> + case 4: >> + data->_u32 &= (uint32_t)mask; >> + break; >> + case 8: >> + data->_u64 &= (uint64_t)mask; >> + break; >> + default: >> + ret = -EINVAL; >> + break; >> + } >> + return ret; >> +} >> + >> +static int op_xor_fn(union op_fn_data *data, uint64_t mask, uint32_t len) >> +{ >> + int ret = 0; >> + >> + switch (len) { >> + case 1: >> + data->_u8 ^= (uint8_t)mask; >> + break; >> + case 2: >> + data->_u16 ^= (uint16_t)mask; >> + break; >> + case 4: >> + data->_u32 ^= (uint32_t)mask; >> + break; >> + case 8: >> + data->_u64 ^= (uint64_t)mask; >> + break; >> + default: >> + ret = -EINVAL; >> + break; >> + } >> + return ret; >> +} >> + >> +static int op_lshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len) >> +{ >> + int ret = 0; >> + >> + switch (len) { >> + case 1: >> + data->_u8 <<= (uint8_t)bits; >> + break; >> + case 2: >> + data->_u16 <<= (uint16_t)bits; >> + break; >> + case 4: >> + data->_u32 <<= (uint32_t)bits; >> + break; >> + case 8: >> + data->_u64 <<= (uint64_t)bits; >> + break; >> + default: >> + ret = -EINVAL; >> + break; >> + } >> + return ret; >> +} >> + >> +static int op_rshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len) >> +{ >> + int ret = 0; >> + >> + switch (len) { >> + case 1: >> + data->_u8 >>= (uint8_t)bits; >> + break; >> + case 2: >> + data->_u16 >>= (uint16_t)bits; >> + break; >> + case 4: >> + data->_u32 >>= (uint32_t)bits; >> + break; >> + case 8: >> + data->_u64 >>= (uint64_t)bits; >> + break; >> + default: >> + ret = -EINVAL; >> + break; >> + } >> + return ret; >> +} >> + >> +/* Return 0 on success, < 0 on error. */ >> +static int do_cpu_op_fn(op_fn_t op_fn, void __user *p, uint64_t v, >> + uint32_t len) >> +{ >> + int ret = -EFAULT; >> + union op_fn_data tmp; >> + >> + pagefault_disable(); >> + switch (len) { >> + case 1: >> + if (__get_user(tmp._u8, (uint8_t __user *)p)) >> + goto end; >> + if (op_fn(&tmp, v, len)) >> + goto end; >> + if (__put_user(tmp._u8, (uint8_t __user *)p)) >> + goto end; >> + break; >> + case 2: >> + if (__get_user(tmp._u16, (uint16_t __user *)p)) >> + goto end; >> + if (op_fn(&tmp, v, len)) >> + goto end; >> + if (__put_user(tmp._u16, (uint16_t __user *)p)) >> + goto end; >> + break; >> + case 4: >> + if (__get_user(tmp._u32, (uint32_t __user *)p)) >> + goto end; >> + if (op_fn(&tmp, v, len)) >> + goto end; >> + if (__put_user(tmp._u32, (uint32_t __user *)p)) >> + goto end; >> + break; >> + case 8: >> +#if (BITS_PER_LONG >= 64) >> + if (__get_user(tmp._u64, (uint64_t __user *)p)) >> + goto end; >> +#else >> + if (__get_user(tmp._u64_split[0], (uint32_t __user *)p)) >> + goto end; >> + if (__get_user(tmp._u64_split[1], (uint32_t __user *)p + 1)) >> + goto end; >> +#endif >> + if (op_fn(&tmp, v, len)) >> + goto end; >> +#if (BITS_PER_LONG >= 64) >> + if (__put_user(tmp._u64, (uint64_t __user *)p)) >> + goto end; >> +#else >> + if (__put_user(tmp._u64_split[0], (uint32_t __user *)p)) >> + goto end; >> + if (__put_user(tmp._u64_split[1], (uint32_t __user *)p + 1)) >> + goto end; >> +#endif >> + break; >> + default: >> + ret = -EINVAL; >> + goto end; >> + } >> + ret = 0; >> +end: >> + pagefault_enable(); >> + return ret; >> +} >> + >> +static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt) >> +{ >> + int i, ret; >> + >> + for (i = 0; i < cpuopcnt; i++) { >> + struct cpu_op *op = &cpuop[i]; >> + >> + /* Guarantee a compiler barrier between each operation. */ >> + barrier(); >> + >> + switch (op->op) { >> + case CPU_COMPARE_EQ_OP: >> + ret = do_cpu_op_compare( >> + (void __user *)op->u.compare_op.a, >> + (void __user *)op->u.compare_op.b, >> + op->len); >> + /* Stop execution on error. */ >> + if (ret < 0) >> + return ret; >> + /* >> + * Stop execution, return op index + 1 if comparison >> + * differs. >> + */ >> + if (ret > 0) >> + return i + 1; >> + break; >> + case CPU_COMPARE_NE_OP: >> + ret = do_cpu_op_compare( >> + (void __user *)op->u.compare_op.a, >> + (void __user *)op->u.compare_op.b, >> + op->len); >> + /* Stop execution on error. */ >> + if (ret < 0) >> + return ret; >> + /* >> + * Stop execution, return op index + 1 if comparison >> + * is identical. >> + */ >> + if (ret == 0) >> + return i + 1; >> + break; >> + case CPU_MEMCPY_OP: >> + ret = do_cpu_op_memcpy( >> + (void __user *)op->u.memcpy_op.dst, >> + (void __user *)op->u.memcpy_op.src, >> + op->len); >> + /* Stop execution on error. */ >> + if (ret) >> + return ret; >> + break; >> + case CPU_ADD_OP: >> + ret = do_cpu_op_fn(op_add_fn, >> + (void __user *)op->u.arithmetic_op.p, >> + op->u.arithmetic_op.count, op->len); >> + /* Stop execution on error. */ >> + if (ret) >> + return ret; >> + break; >> + case CPU_OR_OP: >> + ret = do_cpu_op_fn(op_or_fn, >> + (void __user *)op->u.bitwise_op.p, >> + op->u.bitwise_op.mask, op->len); >> + /* Stop execution on error. */ >> + if (ret) >> + return ret; >> + break; >> + case CPU_AND_OP: >> + ret = do_cpu_op_fn(op_and_fn, >> + (void __user *)op->u.bitwise_op.p, >> + op->u.bitwise_op.mask, op->len); >> + /* Stop execution on error. */ >> + if (ret) >> + return ret; >> + break; >> + case CPU_XOR_OP: >> + ret = do_cpu_op_fn(op_xor_fn, >> + (void __user *)op->u.bitwise_op.p, >> + op->u.bitwise_op.mask, op->len); >> + /* Stop execution on error. */ >> + if (ret) >> + return ret; >> + break; >> + case CPU_LSHIFT_OP: >> + ret = do_cpu_op_fn(op_lshift_fn, >> + (void __user *)op->u.shift_op.p, >> + op->u.shift_op.bits, op->len); >> + /* Stop execution on error. */ >> + if (ret) >> + return ret; >> + break; >> + case CPU_RSHIFT_OP: >> + ret = do_cpu_op_fn(op_rshift_fn, >> + (void __user *)op->u.shift_op.p, >> + op->u.shift_op.bits, op->len); >> + /* Stop execution on error. */ >> + if (ret) >> + return ret; >> + break; >> + case CPU_MB_OP: >> + smp_mb(); >> + break; >> + default: >> + return -EINVAL; >> + } >> + } >> + return 0; >> +} >> + >> +static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt, int cpu) >> +{ >> + int ret; >> + >> + if (cpu != raw_smp_processor_id()) { >> + ret = push_task_to_cpu(current, cpu); >> + if (ret) >> + goto check_online; >> + } >> + preempt_disable(); >> + if (cpu != smp_processor_id()) { >> + ret = -EAGAIN; >> + goto end; >> + } >> + ret = __do_cpu_opv(cpuop, cpuopcnt); >> +end: >> + preempt_enable(); >> + return ret; >> + >> +check_online: >> + if (!cpu_possible(cpu)) >> + return -EINVAL; >> + get_online_cpus(); >> + if (cpu_online(cpu)) { >> + ret = -EAGAIN; >> + goto put_online_cpus; >> + } >> + /* >> + * CPU is offline. Perform operation from the current CPU with >> + * cpu_online read lock held, preventing that CPU from coming online, >> + * and with mutex held, providing mutual exclusion against other >> + * CPUs also finding out about an offline CPU. >> + */ >> + mutex_lock(&cpu_opv_offline_lock); >> + ret = __do_cpu_opv(cpuop, cpuopcnt); >> + mutex_unlock(&cpu_opv_offline_lock); >> +put_online_cpus: >> + put_online_cpus(); >> + return ret; >> +} >> + >> +/* >> + * cpu_opv - execute operation vector on a given CPU with preempt off. >> + * >> + * Userspace should pass current CPU number as parameter. May fail with >> + * -EAGAIN if currently executing on the wrong CPU. >> + */ >> +SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt, >> + int, cpu, int, flags) >> +{ >> + struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX]; >> + struct page *pinned_pages_on_stack[NR_PINNED_PAGES_ON_STACK]; >> + struct cpu_opv_pinned_pages pin_pages = { >> + .pages = pinned_pages_on_stack, >> + .nr = 0, >> + .is_kmalloc = false, >> + }; >> + int ret, i; >> + >> + if (unlikely(flags)) >> + return -EINVAL; >> + if (unlikely(cpu < 0)) >> + return -EINVAL; >> + if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX) >> + return -EINVAL; >> + if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op))) >> + return -EFAULT; >> + ret = cpu_opv_check(cpuopv, cpuopcnt); >> + if (ret) >> + return ret; >> + ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &pin_pages); >> + if (ret) >> + goto end; >> + ret = do_cpu_opv(cpuopv, cpuopcnt, cpu); >> + for (i = 0; i < pin_pages.nr; i++) >> + put_page(pin_pages.pages[i]); >> +end: >> + if (pin_pages.is_kmalloc) >> + kfree(pin_pages.pages); >> + return ret; >> +} >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 6bba05f47e51..e547f93a46c2 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -1052,6 +1052,43 @@ void do_set_cpus_allowed(struct task_struct *p, const >> struct cpumask *new_mask) >> set_curr_task(rq, p); >> } >> >> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu) >> +{ >> + struct rq_flags rf; >> + struct rq *rq; >> + int ret = 0; >> + >> + rq = task_rq_lock(p, &rf); >> + update_rq_clock(rq); >> + >> + if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) { >> + ret = -EINVAL; >> + goto out; >> + } >> + >> + if (task_cpu(p) == dest_cpu) >> + goto out; >> + >> + if (task_running(rq, p) || p->state == TASK_WAKING) { >> + struct migration_arg arg = { p, dest_cpu }; >> + /* Need help from migration thread: drop lock and wait. */ >> + task_rq_unlock(rq, p, &rf); >> + stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg); >> + tlb_migrate_finish(p->mm); >> + return 0; >> + } else if (task_on_rq_queued(p)) { >> + /* >> + * OK, since we're going to drop the lock immediately >> + * afterwards anyway. >> + */ >> + rq = move_queued_task(rq, &rf, p, dest_cpu); >> + } >> +out: >> + task_rq_unlock(rq, p, &rf); >> + >> + return ret; >> +} >> + >> /* >> * Change a given task's CPU affinity. Migrate the thread to a >> * proper CPU and schedule it away if the CPU it's executing on >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 3b448ba82225..cab256c1720a 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -1209,6 +1209,8 @@ static inline void __set_task_cpu(struct task_struct *p, >> unsigned int cpu) >> #endif >> } >> >> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu); >> + >> /* >> * Tunables that become constants when CONFIG_SCHED_DEBUG is off: >> */ >> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c >> index bfa1ee1bf669..59e622296dc3 100644 >> --- a/kernel/sys_ni.c >> +++ b/kernel/sys_ni.c >> @@ -262,3 +262,4 @@ cond_syscall(sys_pkey_free); >> >> /* restartable sequence */ >> cond_syscall(sys_rseq); >> +cond_syscall(sys_cpu_opv); >> -- >> 2.11.0 >> >> >> > > > > -- > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/ -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html