Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Wed, 15 Nov 2017 14:30:47 +0000 (UTC)

----- On Nov 15, 2017, at 2:44 AM, Michael Kerrisk mtk.manpages@xxxxxxxxx wrote:

> Hi Matthieu
> 
> On 14 November 2017 at 21:03, Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>> This new cpu_opv system call executes a vector of operations on behalf
>> of user-space on a specific CPU with preemption disabled. It is inspired
>> from readv() and writev() system calls which take a "struct iovec" array
>> as argument.
> 
> Do you have a man page for this syscall already?

Hi Michael,

It's the next thing on my roadmap when the syscall reaches mainline.
That and membarrier commands man pages updates.

Thanks,

Mathieu

> 
> Thanks,
> 
> Michael
> 
> 
>> The operations available are: comparison, memcpy, add, or, and, xor,
>> left shift, right shift, and mb. The system call receives a CPU number
>> from user-space as argument, which is the CPU on which those operations
>> need to be performed. All preparation steps such as loading pointers,
>> and applying offsets to arrays, need to be performed by user-space
>> before invoking the system call. The "comparison" operation can be used
>> to check that the data used in the preparation step did not change
>> between preparation of system call inputs and operation execution within
>> the preempt-off critical section.
>>
>> The reason why we require all pointer offsets to be calculated by
>> user-space beforehand is because we need to use get_user_pages_fast() to
>> first pin all pages touched by each operation. This takes care of
>> faulting-in the pages. Then, preemption is disabled, and the operations
>> are performed atomically with respect to other thread execution on that
>> CPU, without generating any page fault.
>>
>> A maximum limit of 16 operations per cpu_opv syscall invocation is
>> enforced, so user-space cannot generate a too long preempt-off critical
>> section. Each operation is also limited a length of PAGE_SIZE bytes,
>> meaning that an operation can touch a maximum of 4 pages (memcpy: 2
>> pages for source, 2 pages for destination if addresses are not aligned
>> on page boundaries). Moreover, a total limit of 4216 bytes is applied
>> to operation lengths.
>>
>> If the thread is not running on the requested CPU, a new
>> push_task_to_cpu() is invoked to migrate the task to the requested CPU.
>> If the requested CPU is not part of the cpus allowed mask of the thread,
>> the system call fails with EINVAL. After the migration has been
>> performed, preemption is disabled, and the current CPU number is checked
>> again and compared to the requested CPU number. If it still differs, it
>> means the scheduler migrated us away from that CPU. Return EAGAIN to
>> user-space in that case, and let user-space retry (either requesting the
>> same CPU number, or a different one, depending on the user-space
>> algorithm constraints).
>>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
>> CC: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
>> CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> CC: Paul Turner <pjt@xxxxxxxxxx>
>> CC: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>> CC: Andrew Hunter <ahh@xxxxxxxxxx>
>> CC: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
>> CC: Andi Kleen <andi@xxxxxxxxxxxxxx>
>> CC: Dave Watson <davejwatson@xxxxxx>
>> CC: Chris Lameter <cl@xxxxxxxxx>
>> CC: Ingo Molnar <mingo@xxxxxxxxxx>
>> CC: "H. Peter Anvin" <hpa@xxxxxxxxx>
>> CC: Ben Maurer <bmaurer@xxxxxx>
>> CC: Steven Rostedt <rostedt@xxxxxxxxxxx>
>> CC: Josh Triplett <josh@xxxxxxxxxxxxxxxx>
>> CC: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>> CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> CC: Russell King <linux@xxxxxxxxxxxxxxxx>
>> CC: Catalin Marinas <catalin.marinas@xxxxxxx>
>> CC: Will Deacon <will.deacon@xxxxxxx>
>> CC: Michael Kerrisk <mtk.manpages@xxxxxxxxx>
>> CC: Boqun Feng <boqun.feng@xxxxxxxxx>
>> CC: linux-api@xxxxxxxxxxxxxxx
>> ---
>>
>> Changes since v1:
>> - handle CPU hotplug,
>> - cleanup implementation using function pointers: We can use function
>>   pointers to implement the operations rather than duplicating all the
>>   user-access code.
>> - refuse device pages: Performing cpu_opv operations on io map'd pages
>>   with preemption disabled could generate long preempt-off critical
>>   sections, which leads to unwanted scheduler latency. Return EFAULT if
>>   a device page is received as parameter
>> - restrict op vector to 4216 bytes length sum: Restrict the operation
>>   vector to length sum of:
>>   - 4096 bytes (typical page size on most architectures, should be
>>     enough for a string, or structures)
>>   - 15 * 8 bytes (typical operations on integers or pointers).
>>   The goal here is to keep the duration of preempt off critical section
>>   short, so we don't add significant scheduler latency.
>> - Add INIT_ONSTACK macro: Introduce the
>>   CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users
>>   correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their
>>   stack to 0 on 32-bit architectures.
>> - Add CPU_MB_OP operation:
>>   Use-cases with:
>>   - two consecutive stores,
>>   - a mempcy followed by a store,
>>   require a memory barrier before the final store operation. A typical
>>   use-case is a store-release on the final store. Given that this is a
>>   slow path, just providing an explicit full barrier instruction should
>>   be sufficient.
>> - Add expect fault field:
>>   The use-case of list_pop brings interesting challenges. With rseq, we
>>   can use rseq_cmpnev_storeoffp_load(), and therefore load a pointer,
>>   compare it against NULL, add an offset, and load the target "next"
>>   pointer from the object, all within a single req critical section.
>>
>>   Life is not so easy for cpu_opv in this use-case, mainly because we
>>   need to pin all pages we are going to touch in the preempt-off
>>   critical section beforehand. So we need to know the target object (in
>>   which we apply an offset to fetch the next pointer) when we pin pages
>>   before disabling preemption.
>>
>>   So the approach is to load the head pointer and compare it against
>>   NULL in user-space, before doing the cpu_opv syscall. User-space can
>>   then compute the address of the head->next field, *without loading it*.
>>
>>   The cpu_opv system call will first need to pin all pages associated
>>   with input data. This includes the page backing the head->next object,
>>   which may have been concurrently deallocated and unmapped. Therefore,
>>   in this case, getting -EFAULT when trying to pin those pages may
>>   happen: it just means they have been concurrently unmapped. This is
>>   an expected situation, and should just return -EAGAIN to user-space,
>>   to user-space can distinguish between "should retry" type of
>>   situations and actual errors that should be handled with extreme
>>   prejudice to the program (e.g. abort()).
>>
>>   Therefore, add "expect_fault" fields along with op input address
>>   pointers, so user-space can identify whether a fault when getting a
>>   field should return EAGAIN rather than EFAULT.
>> - Add compiler barrier between operations: Adding a compiler barrier
>>   between store operations in a cpu_opv sequence can be useful when
>>   paired with membarrier system call.
>>
>>   An algorithm with a paired slow path and fast path can use
>>   sys_membarrier on the slow path to replace fast-path memory barriers
>>   by compiler barrier.
>>
>>   Adding an explicit compiler barrier between operations allows
>>   cpu_opv to be used as fallback for operations meant to match
>>   the membarrier system call.
>>
>> Changes since v2:
>>
>> - Fix memory leak by introducing struct cpu_opv_pinned_pages.
>>   Suggested by Boqun Feng.
>> - Cast argument 1 passed to access_ok from integer to void __user *,
>>   fixing sparse warning.
>> ---
>>  MAINTAINERS                  |   7 +
>>  include/uapi/linux/cpu_opv.h | 117 ++++++
>>  init/Kconfig                 |  14 +
>>  kernel/Makefile              |   1 +
>>  kernel/cpu_opv.c             | 968 +++++++++++++++++++++++++++++++++++++++++++
>>  kernel/sched/core.c          |  37 ++
>>  kernel/sched/sched.h         |   2 +
>>  kernel/sys_ni.c              |   1 +
>>  8 files changed, 1147 insertions(+)
>>  create mode 100644 include/uapi/linux/cpu_opv.h
>>  create mode 100644 kernel/cpu_opv.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index c9f95f8b07ed..45a1bbdaa287 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -3675,6 +3675,13 @@ B:       https://bugzilla.kernel.org
>>  F:     drivers/cpuidle/*
>>  F:     include/linux/cpuidle.h
>>
>> +CPU NON-PREEMPTIBLE OPERATION VECTOR SUPPORT
>> +M:     Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
>> +L:     linux-kernel@xxxxxxxxxxxxxxx
>> +S:     Supported
>> +F:     kernel/cpu_opv.c
>> +F:     include/uapi/linux/cpu_opv.h
>> +
>>  CRAMFS FILESYSTEM
>>  W:     http://sourceforge.net/projects/cramfs/
>>  S:     Orphan / Obsolete
>> diff --git a/include/uapi/linux/cpu_opv.h b/include/uapi/linux/cpu_opv.h
>> new file mode 100644
>> index 000000000000..17f7d46e053b
>> --- /dev/null
>> +++ b/include/uapi/linux/cpu_opv.h
>> @@ -0,0 +1,117 @@
>> +#ifndef _UAPI_LINUX_CPU_OPV_H
>> +#define _UAPI_LINUX_CPU_OPV_H
>> +
>> +/*
>> + * linux/cpu_opv.h
>> + *
>> + * CPU preempt-off operation vector system call API
>> + *
>> + * Copyright (c) 2017 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>> + * of this software and associated documentation files (the "Software"), to
>> deal
>> + * in the Software without restriction, including without limitation the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>> + * copies of the Software, and to permit persons to whom the Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
>> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>> FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> THE
>> + * SOFTWARE.
>> + */
>> +
>> +#ifdef __KERNEL__
>> +# include <linux/types.h>
>> +#else  /* #ifdef __KERNEL__ */
>> +# include <stdint.h>
>> +#endif /* #else #ifdef __KERNEL__ */
>> +
>> +#include <asm/byteorder.h>
>> +
>> +#ifdef __LP64__
>> +# define CPU_OP_FIELD_u32_u64(field)                   uint64_t field
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)   field = (intptr_t)v
>> +#elif defined(__BYTE_ORDER) ? \
>> +       __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
>> +# define CPU_OP_FIELD_u32_u64(field)   uint32_t field ## _padding, field
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)   \
>> +       field ## _padding = 0, field = (intptr_t)v
>> +#else
>> +# define CPU_OP_FIELD_u32_u64(field)   uint32_t field, field ## _padding
>> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)   \
>> +       field = (intptr_t)v, field ## _padding = 0
>> +#endif
>> +
>> +#define CPU_OP_VEC_LEN_MAX             16
>> +#define CPU_OP_ARG_LEN_MAX             24
>> +/* Max. data len per operation. */
>> +#define CPU_OP_DATA_LEN_MAX            PAGE_SIZE
>> +/*
>> + * Max. data len for overall vector. We to restrict the amount of
>> + * user-space data touched by the kernel in non-preemptible context so
>> + * we do not introduce long scheduler latencies.
>> + * This allows one copy of up to 4096 bytes, and 15 operations touching
>> + * 8 bytes each.
>> + * This limit is applied to the sum of length specified for all
>> + * operations in a vector.
>> + */
>> +#define CPU_OP_VEC_DATA_LEN_MAX                (4096 + 15*8)
>> +#define CPU_OP_MAX_PAGES               4       /* Max. pages per op. */
>> +
>> +enum cpu_op_type {
>> +       CPU_COMPARE_EQ_OP,      /* compare */
>> +       CPU_COMPARE_NE_OP,      /* compare */
>> +       CPU_MEMCPY_OP,          /* memcpy */
>> +       CPU_ADD_OP,             /* arithmetic */
>> +       CPU_OR_OP,              /* bitwise */
>> +       CPU_AND_OP,             /* bitwise */
>> +       CPU_XOR_OP,             /* bitwise */
>> +       CPU_LSHIFT_OP,          /* shift */
>> +       CPU_RSHIFT_OP,          /* shift */
>> +       CPU_MB_OP,              /* memory barrier */
>> +};
>> +
>> +/* Vector of operations to perform. Limited to 16. */
>> +struct cpu_op {
>> +       int32_t op;     /* enum cpu_op_type. */
>> +       uint32_t len;   /* data length, in bytes. */
>> +       union {
>> +               struct {
>> +                       CPU_OP_FIELD_u32_u64(a);
>> +                       CPU_OP_FIELD_u32_u64(b);
>> +                       uint8_t expect_fault_a;
>> +                       uint8_t expect_fault_b;
>> +               } compare_op;
>> +               struct {
>> +                       CPU_OP_FIELD_u32_u64(dst);
>> +                       CPU_OP_FIELD_u32_u64(src);
>> +                       uint8_t expect_fault_dst;
>> +                       uint8_t expect_fault_src;
>> +               } memcpy_op;
>> +               struct {
>> +                       CPU_OP_FIELD_u32_u64(p);
>> +                       int64_t count;
>> +                       uint8_t expect_fault_p;
>> +               } arithmetic_op;
>> +               struct {
>> +                       CPU_OP_FIELD_u32_u64(p);
>> +                       uint64_t mask;
>> +                       uint8_t expect_fault_p;
>> +               } bitwise_op;
>> +               struct {
>> +                       CPU_OP_FIELD_u32_u64(p);
>> +                       uint32_t bits;
>> +                       uint8_t expect_fault_p;
>> +               } shift_op;
>> +               char __padding[CPU_OP_ARG_LEN_MAX];
>> +       } u;
>> +};
>> +
>> +#endif /* _UAPI_LINUX_CPU_OPV_H */
>> diff --git a/init/Kconfig b/init/Kconfig
>> index cbedfb91b40a..e4fbb5dd6a24 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1404,6 +1404,7 @@ config RSEQ
>>         bool "Enable rseq() system call" if EXPERT
>>         default y
>>         depends on HAVE_RSEQ
>> +       select CPU_OPV
>>         select MEMBARRIER
>>         help
>>           Enable the restartable sequences system call. It provides a
>> @@ -1414,6 +1415,19 @@ config RSEQ
>>
>>           If unsure, say Y.
>>
>> +config CPU_OPV
>> +       bool "Enable cpu_opv() system call" if EXPERT
>> +       default y
>> +       help
>> +         Enable the CPU preempt-off operation vector system call.
>> +         It allows user-space to perform a sequence of operations on
>> +         per-cpu data with preemption disabled. Useful as
>> +         single-stepping fall-back for restartable sequences, and for
>> +         performing more complex operations on per-cpu data that would
>> +         not be otherwise possible to do with restartable sequences.
>> +
>> +         If unsure, say Y.
>> +
>>  config EMBEDDED
>>         bool "Embedded system"
>>         option allnoconfig_y
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index 3574669dafd9..cac8855196ff 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -113,6 +113,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
>>
>>  obj-$(CONFIG_HAS_IOMEM) += memremap.o
>>  obj-$(CONFIG_RSEQ) += rseq.o
>> +obj-$(CONFIG_CPU_OPV) += cpu_opv.o
>>
>>  $(obj)/configs.o: $(obj)/config_data.h
>>
>> diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c
>> new file mode 100644
>> index 000000000000..a81837a14b17
>> --- /dev/null
>> +++ b/kernel/cpu_opv.c
>> @@ -0,0 +1,968 @@
>> +/*
>> + * CPU preempt-off operation vector system call
>> + *
>> + * It allows user-space to perform a sequence of operations on per-cpu
>> + * data with preemption disabled. Useful as single-stepping fall-back
>> + * for restartable sequences, and for performing more complex operations
>> + * on per-cpu data that would not be otherwise possible to do with
>> + * restartable sequences.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * Copyright (C) 2017, EfficiOS Inc.,
>> + * Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
>> + */
>> +
>> +#include <linux/sched.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/syscalls.h>
>> +#include <linux/cpu_opv.h>
>> +#include <linux/types.h>
>> +#include <linux/mutex.h>
>> +#include <linux/pagemap.h>
>> +#include <asm/ptrace.h>
>> +#include <asm/byteorder.h>
>> +
>> +#include "sched/sched.h"
>> +
>> +#define TMP_BUFLEN                     64
>> +#define NR_PINNED_PAGES_ON_STACK       8
>> +
>> +union op_fn_data {
>> +       uint8_t _u8;
>> +       uint16_t _u16;
>> +       uint32_t _u32;
>> +       uint64_t _u64;
>> +#if (BITS_PER_LONG < 64)
>> +       uint32_t _u64_split[2];
>> +#endif
>> +};
>> +
>> +struct cpu_opv_pinned_pages {
>> +       struct page **pages;
>> +       size_t nr;
>> +       bool is_kmalloc;
>> +};
>> +
>> +typedef int (*op_fn_t)(union op_fn_data *data, uint64_t v, uint32_t len);
>> +
>> +static DEFINE_MUTEX(cpu_opv_offline_lock);
>> +
>> +/*
>> + * The cpu_opv system call executes a vector of operations on behalf of
>> + * user-space on a specific CPU with preemption disabled. It is inspired
>> + * from readv() and writev() system calls which take a "struct iovec"
>> + * array as argument.
>> + *
>> + * The operations available are: comparison, memcpy, add, or, and, xor,
>> + * left shift, and right shift. The system call receives a CPU number
>> + * from user-space as argument, which is the CPU on which those
>> + * operations need to be performed. All preparation steps such as
>> + * loading pointers, and applying offsets to arrays, need to be
>> + * performed by user-space before invoking the system call. The
>> + * "comparison" operation can be used to check that the data used in the
>> + * preparation step did not change between preparation of system call
>> + * inputs and operation execution within the preempt-off critical
>> + * section.
>> + *
>> + * The reason why we require all pointer offsets to be calculated by
>> + * user-space beforehand is because we need to use get_user_pages_fast()
>> + * to first pin all pages touched by each operation. This takes care of
>> + * faulting-in the pages. Then, preemption is disabled, and the
>> + * operations are performed atomically with respect to other thread
>> + * execution on that CPU, without generating any page fault.
>> + *
>> + * A maximum limit of 16 operations per cpu_opv syscall invocation is
>> + * enforced, and a overall maximum length sum, so user-space cannot
>> + * generate a too long preempt-off critical section. Each operation is
>> + * also limited a length of PAGE_SIZE bytes, meaning that an operation
>> + * can touch a maximum of 4 pages (memcpy: 2 pages for source, 2 pages
>> + * for destination if addresses are not aligned on page boundaries).
>> + *
>> + * If the thread is not running on the requested CPU, a new
>> + * push_task_to_cpu() is invoked to migrate the task to the requested
>> + * CPU.  If the requested CPU is not part of the cpus allowed mask of
>> + * the thread, the system call fails with EINVAL. After the migration
>> + * has been performed, preemption is disabled, and the current CPU
>> + * number is checked again and compared to the requested CPU number. If
>> + * it still differs, it means the scheduler migrated us away from that
>> + * CPU. Return EAGAIN to user-space in that case, and let user-space
>> + * retry (either requesting the same CPU number, or a different one,
>> + * depending on the user-space algorithm constraints).
>> + */
>> +
>> +/*
>> + * Check operation types and length parameters.
>> + */
>> +static int cpu_opv_check(struct cpu_op *cpuop, int cpuopcnt)
>> +{
>> +       int i;
>> +       uint32_t sum = 0;
>> +
>> +       for (i = 0; i < cpuopcnt; i++) {
>> +               struct cpu_op *op = &cpuop[i];
>> +
>> +               switch (op->op) {
>> +               case CPU_MB_OP:
>> +                       break;
>> +               default:
>> +                       sum += op->len;
>> +               }
>> +               switch (op->op) {
>> +               case CPU_COMPARE_EQ_OP:
>> +               case CPU_COMPARE_NE_OP:
>> +               case CPU_MEMCPY_OP:
>> +                       if (op->len > CPU_OP_DATA_LEN_MAX)
>> +                               return -EINVAL;
>> +                       break;
>> +               case CPU_ADD_OP:
>> +               case CPU_OR_OP:
>> +               case CPU_AND_OP:
>> +               case CPU_XOR_OP:
>> +                       switch (op->len) {
>> +                       case 1:
>> +                       case 2:
>> +                       case 4:
>> +                       case 8:
>> +                               break;
>> +                       default:
>> +                               return -EINVAL;
>> +                       }
>> +                       break;
>> +               case CPU_LSHIFT_OP:
>> +               case CPU_RSHIFT_OP:
>> +                       switch (op->len) {
>> +                       case 1:
>> +                               if (op->u.shift_op.bits > 7)
>> +                                       return -EINVAL;
>> +                               break;
>> +                       case 2:
>> +                               if (op->u.shift_op.bits > 15)
>> +                                       return -EINVAL;
>> +                               break;
>> +                       case 4:
>> +                               if (op->u.shift_op.bits > 31)
>> +                                       return -EINVAL;
>> +                               break;
>> +                       case 8:
>> +                               if (op->u.shift_op.bits > 63)
>> +                                       return -EINVAL;
>> +                               break;
>> +                       default:
>> +                               return -EINVAL;
>> +                       }
>> +                       break;
>> +               case CPU_MB_OP:
>> +                       break;
>> +               default:
>> +                       return -EINVAL;
>> +               }
>> +       }
>> +       if (sum > CPU_OP_VEC_DATA_LEN_MAX)
>> +               return -EINVAL;
>> +       return 0;
>> +}
>> +
>> +static unsigned long cpu_op_range_nr_pages(unsigned long addr,
>> +               unsigned long len)
>> +{
>> +       return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1;
>> +}
>> +
>> +static int cpu_op_check_page(struct page *page)
>> +{
>> +       struct address_space *mapping;
>> +
>> +       if (is_zone_device_page(page))
>> +               return -EFAULT;
>> +       page = compound_head(page);
>> +       mapping = READ_ONCE(page->mapping);
>> +       if (!mapping) {
>> +               int shmem_swizzled;
>> +
>> +               /*
>> +                * Check again with page lock held to guard against
>> +                * memory pressure making shmem_writepage move the page
>> +                * from filecache to swapcache.
>> +                */
>> +               lock_page(page);
>> +               shmem_swizzled = PageSwapCache(page) || page->mapping;
>> +               unlock_page(page);
>> +               if (shmem_swizzled)
>> +                       return -EAGAIN;
>> +               return -EFAULT;
>> +       }
>> +       return 0;
>> +}
>> +
>> +/*
>> + * Refusing device pages, the zero page, pages in the gate area, and
>> + * special mappings. Inspired from futex.c checks.
>> + */
>> +static int cpu_op_check_pages(struct page **pages,
>> +               unsigned long nr_pages)
>> +{
>> +       unsigned long i;
>> +
>> +       for (i = 0; i < nr_pages; i++) {
>> +               int ret;
>> +
>> +               ret = cpu_op_check_page(pages[i]);
>> +               if (ret)
>> +                       return ret;
>> +       }
>> +       return 0;
>> +}
>> +
>> +static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
>> +               struct cpu_opv_pinned_pages *pin_pages, int write)
>> +{
>> +       struct page *pages[2];
>> +       int ret, nr_pages;
>> +
>> +       if (!len)
>> +               return 0;
>> +       nr_pages = cpu_op_range_nr_pages(addr, len);
>> +       BUG_ON(nr_pages > 2);
>> +       if (!pin_pages->is_kmalloc && pin_pages->nr + nr_pages
>> +                       > NR_PINNED_PAGES_ON_STACK) {
>> +               struct page **pinned_pages =
>> +                       kzalloc(CPU_OP_VEC_LEN_MAX * CPU_OP_MAX_PAGES
>> +                               * sizeof(struct page *), GFP_KERNEL);
>> +               if (!pinned_pages)
>> +                       return -ENOMEM;
>> +               memcpy(pinned_pages, pin_pages->pages,
>> +                       pin_pages->nr * sizeof(struct page *));
>> +               pin_pages->pages = pinned_pages;
>> +               pin_pages->is_kmalloc = true;
>> +       }
>> +again:
>> +       ret = get_user_pages_fast(addr, nr_pages, write, pages);
>> +       if (ret < nr_pages) {
>> +               if (ret > 0)
>> +                       put_page(pages[0]);
>> +               return -EFAULT;
>> +       }
>> +       /*
>> +        * Refuse device pages, the zero page, pages in the gate area,
>> +        * and special mappings.
>> +        */
>> +       ret = cpu_op_check_pages(pages, nr_pages);
>> +       if (ret == -EAGAIN) {
>> +               put_page(pages[0]);
>> +               if (nr_pages > 1)
>> +                       put_page(pages[1]);
>> +               goto again;
>> +       }
>> +       if (ret)
>> +               goto error;
>> +       pin_pages->pages[pin_pages->nr++] = pages[0];
>> +       if (nr_pages > 1)
>> +               pin_pages->pages[pin_pages->nr++] = pages[1];
>> +       return 0;
>> +
>> +error:
>> +       put_page(pages[0]);
>> +       if (nr_pages > 1)
>> +               put_page(pages[1]);
>> +       return -EFAULT;
>> +}
>> +
>> +static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt,
>> +               struct cpu_opv_pinned_pages *pin_pages)
>> +{
>> +       int ret, i;
>> +       bool expect_fault = false;
>> +
>> +       /* Check access, pin pages. */
>> +       for (i = 0; i < cpuopcnt; i++) {
>> +               struct cpu_op *op = &cpuop[i];
>> +
>> +               switch (op->op) {
>> +               case CPU_COMPARE_EQ_OP:
>> +               case CPU_COMPARE_NE_OP:
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.compare_op.expect_fault_a;
>> +                       if (!access_ok(VERIFY_READ,
>> +                                       (void __user *)op->u.compare_op.a,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.compare_op.a,
>> +                                       op->len, pin_pages, 0);
>> +                       if (ret)
>> +                               goto error;
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.compare_op.expect_fault_b;
>> +                       if (!access_ok(VERIFY_READ,
>> +                                       (void __user *)op->u.compare_op.b,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.compare_op.b,
>> +                                       op->len, pin_pages, 0);
>> +                       if (ret)
>> +                               goto error;
>> +                       break;
>> +               case CPU_MEMCPY_OP:
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.memcpy_op.expect_fault_dst;
>> +                       if (!access_ok(VERIFY_WRITE,
>> +                                       (void __user *)op->u.memcpy_op.dst,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.memcpy_op.dst,
>> +                                       op->len, pin_pages, 1);
>> +                       if (ret)
>> +                               goto error;
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.memcpy_op.expect_fault_src;
>> +                       if (!access_ok(VERIFY_READ,
>> +                                       (void __user *)op->u.memcpy_op.src,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.memcpy_op.src,
>> +                                       op->len, pin_pages, 0);
>> +                       if (ret)
>> +                               goto error;
>> +                       break;
>> +               case CPU_ADD_OP:
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.arithmetic_op.expect_fault_p;
>> +                       if (!access_ok(VERIFY_WRITE,
>> +                                       (void __user *)op->u.arithmetic_op.p,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.arithmetic_op.p,
>> +                                       op->len, pin_pages, 1);
>> +                       if (ret)
>> +                               goto error;
>> +                       break;
>> +               case CPU_OR_OP:
>> +               case CPU_AND_OP:
>> +               case CPU_XOR_OP:
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.bitwise_op.expect_fault_p;
>> +                       if (!access_ok(VERIFY_WRITE,
>> +                                       (void __user *)op->u.bitwise_op.p,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.bitwise_op.p,
>> +                                       op->len, pin_pages, 1);
>> +                       if (ret)
>> +                               goto error;
>> +                       break;
>> +               case CPU_LSHIFT_OP:
>> +               case CPU_RSHIFT_OP:
>> +                       ret = -EFAULT;
>> +                       expect_fault = op->u.shift_op.expect_fault_p;
>> +                       if (!access_ok(VERIFY_WRITE,
>> +                                       (void __user *)op->u.shift_op.p,
>> +                                       op->len))
>> +                               goto error;
>> +                       ret = cpu_op_pin_pages(
>> +                                       (unsigned long)op->u.shift_op.p,
>> +                                       op->len, pin_pages, 1);
>> +                       if (ret)
>> +                               goto error;
>> +                       break;
>> +               case CPU_MB_OP:
>> +                       break;
>> +               default:
>> +                       return -EINVAL;
>> +               }
>> +       }
>> +       return 0;
>> +
>> +error:
>> +       for (i = 0; i < pin_pages->nr; i++)
>> +               put_page(pin_pages->pages[i]);
>> +       pin_pages->nr = 0;
>> +       /*
>> +        * If faulting access is expected, return EAGAIN to user-space.
>> +        * It allows user-space to distinguish between a fault caused by
>> +        * an access which is expect to fault (e.g. due to concurrent
>> +        * unmapping of underlying memory) from an unexpected fault from
>> +        * which a retry would not recover.
>> +        */
>> +       if (ret == -EFAULT && expect_fault)
>> +               return -EAGAIN;
>> +       return ret;
>> +}
>> +
>> +/* Return 0 if same, > 0 if different, < 0 on error. */
>> +static int do_cpu_op_compare_iter(void __user *a, void __user *b, uint32_t len)
>> +{
>> +       char bufa[TMP_BUFLEN], bufb[TMP_BUFLEN];
>> +       uint32_t compared = 0;
>> +
>> +       while (compared != len) {
>> +               unsigned long to_compare;
>> +
>> +               to_compare = min_t(uint32_t, TMP_BUFLEN, len - compared);
>> +               if (__copy_from_user_inatomic(bufa, a + compared, to_compare))
>> +                       return -EFAULT;
>> +               if (__copy_from_user_inatomic(bufb, b + compared, to_compare))
>> +                       return -EFAULT;
>> +               if (memcmp(bufa, bufb, to_compare))
>> +                       return 1;       /* different */
>> +               compared += to_compare;
>> +       }
>> +       return 0;       /* same */
>> +}
>> +
>> +/* Return 0 if same, > 0 if different, < 0 on error. */
>> +static int do_cpu_op_compare(void __user *a, void __user *b, uint32_t len)
>> +{
>> +       int ret = -EFAULT;
>> +       union {
>> +               uint8_t _u8;
>> +               uint16_t _u16;
>> +               uint32_t _u32;
>> +               uint64_t _u64;
>> +#if (BITS_PER_LONG < 64)
>> +               uint32_t _u64_split[2];
>> +#endif
>> +       } tmp[2];
>> +
>> +       pagefault_disable();
>> +       switch (len) {
>> +       case 1:
>> +               if (__get_user(tmp[0]._u8, (uint8_t __user *)a))
>> +                       goto end;
>> +               if (__get_user(tmp[1]._u8, (uint8_t __user *)b))
>> +                       goto end;
>> +               ret = !!(tmp[0]._u8 != tmp[1]._u8);
>> +               break;
>> +       case 2:
>> +               if (__get_user(tmp[0]._u16, (uint16_t __user *)a))
>> +                       goto end;
>> +               if (__get_user(tmp[1]._u16, (uint16_t __user *)b))
>> +                       goto end;
>> +               ret = !!(tmp[0]._u16 != tmp[1]._u16);
>> +               break;
>> +       case 4:
>> +               if (__get_user(tmp[0]._u32, (uint32_t __user *)a))
>> +                       goto end;
>> +               if (__get_user(tmp[1]._u32, (uint32_t __user *)b))
>> +                       goto end;
>> +               ret = !!(tmp[0]._u32 != tmp[1]._u32);
>> +               break;
>> +       case 8:
>> +#if (BITS_PER_LONG >= 64)
>> +               if (__get_user(tmp[0]._u64, (uint64_t __user *)a))
>> +                       goto end;
>> +               if (__get_user(tmp[1]._u64, (uint64_t __user *)b))
>> +                       goto end;
>> +#else
>> +               if (__get_user(tmp[0]._u64_split[0], (uint32_t __user *)a))
>> +                       goto end;
>> +               if (__get_user(tmp[0]._u64_split[1], (uint32_t __user *)a + 1))
>> +                       goto end;
>> +               if (__get_user(tmp[1]._u64_split[0], (uint32_t __user *)b))
>> +                       goto end;
>> +               if (__get_user(tmp[1]._u64_split[1], (uint32_t __user *)b + 1))
>> +                       goto end;
>> +#endif
>> +               ret = !!(tmp[0]._u64 != tmp[1]._u64);
>> +               break;
>> +       default:
>> +               pagefault_enable();
>> +               return do_cpu_op_compare_iter(a, b, len);
>> +       }
>> +end:
>> +       pagefault_enable();
>> +       return ret;
>> +}
>> +
>> +/* Return 0 on success, < 0 on error. */
>> +static int do_cpu_op_memcpy_iter(void __user *dst, void __user *src,
>> +               uint32_t len)
>> +{
>> +       char buf[TMP_BUFLEN];
>> +       uint32_t copied = 0;
>> +
>> +       while (copied != len) {
>> +               unsigned long to_copy;
>> +
>> +               to_copy = min_t(uint32_t, TMP_BUFLEN, len - copied);
>> +               if (__copy_from_user_inatomic(buf, src + copied, to_copy))
>> +                       return -EFAULT;
>> +               if (__copy_to_user_inatomic(dst + copied, buf, to_copy))
>> +                       return -EFAULT;
>> +               copied += to_copy;
>> +       }
>> +       return 0;
>> +}
>> +
>> +/* Return 0 on success, < 0 on error. */
>> +static int do_cpu_op_memcpy(void __user *dst, void __user *src, uint32_t len)
>> +{
>> +       int ret = -EFAULT;
>> +       union {
>> +               uint8_t _u8;
>> +               uint16_t _u16;
>> +               uint32_t _u32;
>> +               uint64_t _u64;
>> +#if (BITS_PER_LONG < 64)
>> +               uint32_t _u64_split[2];
>> +#endif
>> +       } tmp;
>> +
>> +       pagefault_disable();
>> +       switch (len) {
>> +       case 1:
>> +               if (__get_user(tmp._u8, (uint8_t __user *)src))
>> +                       goto end;
>> +               if (__put_user(tmp._u8, (uint8_t __user *)dst))
>> +                       goto end;
>> +               break;
>> +       case 2:
>> +               if (__get_user(tmp._u16, (uint16_t __user *)src))
>> +                       goto end;
>> +               if (__put_user(tmp._u16, (uint16_t __user *)dst))
>> +                       goto end;
>> +               break;
>> +       case 4:
>> +               if (__get_user(tmp._u32, (uint32_t __user *)src))
>> +                       goto end;
>> +               if (__put_user(tmp._u32, (uint32_t __user *)dst))
>> +                       goto end;
>> +               break;
>> +       case 8:
>> +#if (BITS_PER_LONG >= 64)
>> +               if (__get_user(tmp._u64, (uint64_t __user *)src))
>> +                       goto end;
>> +               if (__put_user(tmp._u64, (uint64_t __user *)dst))
>> +                       goto end;
>> +#else
>> +               if (__get_user(tmp._u64_split[0], (uint32_t __user *)src))
>> +                       goto end;
>> +               if (__get_user(tmp._u64_split[1], (uint32_t __user *)src + 1))
>> +                       goto end;
>> +               if (__put_user(tmp._u64_split[0], (uint32_t __user *)dst))
>> +                       goto end;
>> +               if (__put_user(tmp._u64_split[1], (uint32_t __user *)dst + 1))
>> +                       goto end;
>> +#endif
>> +               break;
>> +       default:
>> +               pagefault_enable();
>> +               return do_cpu_op_memcpy_iter(dst, src, len);
>> +       }
>> +       ret = 0;
>> +end:
>> +       pagefault_enable();
>> +       return ret;
>> +}
>> +
>> +static int op_add_fn(union op_fn_data *data, uint64_t count, uint32_t len)
>> +{
>> +       int ret = 0;
>> +
>> +       switch (len) {
>> +       case 1:
>> +               data->_u8 += (uint8_t)count;
>> +               break;
>> +       case 2:
>> +               data->_u16 += (uint16_t)count;
>> +               break;
>> +       case 4:
>> +               data->_u32 += (uint32_t)count;
>> +               break;
>> +       case 8:
>> +               data->_u64 += (uint64_t)count;
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +       return ret;
>> +}
>> +
>> +static int op_or_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
>> +{
>> +       int ret = 0;
>> +
>> +       switch (len) {
>> +       case 1:
>> +               data->_u8 |= (uint8_t)mask;
>> +               break;
>> +       case 2:
>> +               data->_u16 |= (uint16_t)mask;
>> +               break;
>> +       case 4:
>> +               data->_u32 |= (uint32_t)mask;
>> +               break;
>> +       case 8:
>> +               data->_u64 |= (uint64_t)mask;
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +       return ret;
>> +}
>> +
>> +static int op_and_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
>> +{
>> +       int ret = 0;
>> +
>> +       switch (len) {
>> +       case 1:
>> +               data->_u8 &= (uint8_t)mask;
>> +               break;
>> +       case 2:
>> +               data->_u16 &= (uint16_t)mask;
>> +               break;
>> +       case 4:
>> +               data->_u32 &= (uint32_t)mask;
>> +               break;
>> +       case 8:
>> +               data->_u64 &= (uint64_t)mask;
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +       return ret;
>> +}
>> +
>> +static int op_xor_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
>> +{
>> +       int ret = 0;
>> +
>> +       switch (len) {
>> +       case 1:
>> +               data->_u8 ^= (uint8_t)mask;
>> +               break;
>> +       case 2:
>> +               data->_u16 ^= (uint16_t)mask;
>> +               break;
>> +       case 4:
>> +               data->_u32 ^= (uint32_t)mask;
>> +               break;
>> +       case 8:
>> +               data->_u64 ^= (uint64_t)mask;
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +       return ret;
>> +}
>> +
>> +static int op_lshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
>> +{
>> +       int ret = 0;
>> +
>> +       switch (len) {
>> +       case 1:
>> +               data->_u8 <<= (uint8_t)bits;
>> +               break;
>> +       case 2:
>> +               data->_u16 <<= (uint16_t)bits;
>> +               break;
>> +       case 4:
>> +               data->_u32 <<= (uint32_t)bits;
>> +               break;
>> +       case 8:
>> +               data->_u64 <<= (uint64_t)bits;
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +       return ret;
>> +}
>> +
>> +static int op_rshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
>> +{
>> +       int ret = 0;
>> +
>> +       switch (len) {
>> +       case 1:
>> +               data->_u8 >>= (uint8_t)bits;
>> +               break;
>> +       case 2:
>> +               data->_u16 >>= (uint16_t)bits;
>> +               break;
>> +       case 4:
>> +               data->_u32 >>= (uint32_t)bits;
>> +               break;
>> +       case 8:
>> +               data->_u64 >>= (uint64_t)bits;
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +       return ret;
>> +}
>> +
>> +/* Return 0 on success, < 0 on error. */
>> +static int do_cpu_op_fn(op_fn_t op_fn, void __user *p, uint64_t v,
>> +               uint32_t len)
>> +{
>> +       int ret = -EFAULT;
>> +       union op_fn_data tmp;
>> +
>> +       pagefault_disable();
>> +       switch (len) {
>> +       case 1:
>> +               if (__get_user(tmp._u8, (uint8_t __user *)p))
>> +                       goto end;
>> +               if (op_fn(&tmp, v, len))
>> +                       goto end;
>> +               if (__put_user(tmp._u8, (uint8_t __user *)p))
>> +                       goto end;
>> +               break;
>> +       case 2:
>> +               if (__get_user(tmp._u16, (uint16_t __user *)p))
>> +                       goto end;
>> +               if (op_fn(&tmp, v, len))
>> +                       goto end;
>> +               if (__put_user(tmp._u16, (uint16_t __user *)p))
>> +                       goto end;
>> +               break;
>> +       case 4:
>> +               if (__get_user(tmp._u32, (uint32_t __user *)p))
>> +                       goto end;
>> +               if (op_fn(&tmp, v, len))
>> +                       goto end;
>> +               if (__put_user(tmp._u32, (uint32_t __user *)p))
>> +                       goto end;
>> +               break;
>> +       case 8:
>> +#if (BITS_PER_LONG >= 64)
>> +               if (__get_user(tmp._u64, (uint64_t __user *)p))
>> +                       goto end;
>> +#else
>> +               if (__get_user(tmp._u64_split[0], (uint32_t __user *)p))
>> +                       goto end;
>> +               if (__get_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
>> +                       goto end;
>> +#endif
>> +               if (op_fn(&tmp, v, len))
>> +                       goto end;
>> +#if (BITS_PER_LONG >= 64)
>> +               if (__put_user(tmp._u64, (uint64_t __user *)p))
>> +                       goto end;
>> +#else
>> +               if (__put_user(tmp._u64_split[0], (uint32_t __user *)p))
>> +                       goto end;
>> +               if (__put_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
>> +                       goto end;
>> +#endif
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               goto end;
>> +       }
>> +       ret = 0;
>> +end:
>> +       pagefault_enable();
>> +       return ret;
>> +}
>> +
>> +static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt)
>> +{
>> +       int i, ret;
>> +
>> +       for (i = 0; i < cpuopcnt; i++) {
>> +               struct cpu_op *op = &cpuop[i];
>> +
>> +               /* Guarantee a compiler barrier between each operation. */
>> +               barrier();
>> +
>> +               switch (op->op) {
>> +               case CPU_COMPARE_EQ_OP:
>> +                       ret = do_cpu_op_compare(
>> +                                       (void __user *)op->u.compare_op.a,
>> +                                       (void __user *)op->u.compare_op.b,
>> +                                       op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret < 0)
>> +                               return ret;
>> +                       /*
>> +                        * Stop execution, return op index + 1 if comparison
>> +                        * differs.
>> +                        */
>> +                       if (ret > 0)
>> +                               return i + 1;
>> +                       break;
>> +               case CPU_COMPARE_NE_OP:
>> +                       ret = do_cpu_op_compare(
>> +                                       (void __user *)op->u.compare_op.a,
>> +                                       (void __user *)op->u.compare_op.b,
>> +                                       op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret < 0)
>> +                               return ret;
>> +                       /*
>> +                        * Stop execution, return op index + 1 if comparison
>> +                        * is identical.
>> +                        */
>> +                       if (ret == 0)
>> +                               return i + 1;
>> +                       break;
>> +               case CPU_MEMCPY_OP:
>> +                       ret = do_cpu_op_memcpy(
>> +                                       (void __user *)op->u.memcpy_op.dst,
>> +                                       (void __user *)op->u.memcpy_op.src,
>> +                                       op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_ADD_OP:
>> +                       ret = do_cpu_op_fn(op_add_fn,
>> +                                       (void __user *)op->u.arithmetic_op.p,
>> +                                       op->u.arithmetic_op.count, op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_OR_OP:
>> +                       ret = do_cpu_op_fn(op_or_fn,
>> +                                       (void __user *)op->u.bitwise_op.p,
>> +                                       op->u.bitwise_op.mask, op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_AND_OP:
>> +                       ret = do_cpu_op_fn(op_and_fn,
>> +                                       (void __user *)op->u.bitwise_op.p,
>> +                                       op->u.bitwise_op.mask, op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_XOR_OP:
>> +                       ret = do_cpu_op_fn(op_xor_fn,
>> +                                       (void __user *)op->u.bitwise_op.p,
>> +                                       op->u.bitwise_op.mask, op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_LSHIFT_OP:
>> +                       ret = do_cpu_op_fn(op_lshift_fn,
>> +                                       (void __user *)op->u.shift_op.p,
>> +                                       op->u.shift_op.bits, op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_RSHIFT_OP:
>> +                       ret = do_cpu_op_fn(op_rshift_fn,
>> +                                       (void __user *)op->u.shift_op.p,
>> +                                       op->u.shift_op.bits, op->len);
>> +                       /* Stop execution on error. */
>> +                       if (ret)
>> +                               return ret;
>> +                       break;
>> +               case CPU_MB_OP:
>> +                       smp_mb();
>> +                       break;
>> +               default:
>> +                       return -EINVAL;
>> +               }
>> +       }
>> +       return 0;
>> +}
>> +
>> +static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt, int cpu)
>> +{
>> +       int ret;
>> +
>> +       if (cpu != raw_smp_processor_id()) {
>> +               ret = push_task_to_cpu(current, cpu);
>> +               if (ret)
>> +                       goto check_online;
>> +       }
>> +       preempt_disable();
>> +       if (cpu != smp_processor_id()) {
>> +               ret = -EAGAIN;
>> +               goto end;
>> +       }
>> +       ret = __do_cpu_opv(cpuop, cpuopcnt);
>> +end:
>> +       preempt_enable();
>> +       return ret;
>> +
>> +check_online:
>> +       if (!cpu_possible(cpu))
>> +               return -EINVAL;
>> +       get_online_cpus();
>> +       if (cpu_online(cpu)) {
>> +               ret = -EAGAIN;
>> +               goto put_online_cpus;
>> +       }
>> +       /*
>> +        * CPU is offline. Perform operation from the current CPU with
>> +        * cpu_online read lock held, preventing that CPU from coming online,
>> +        * and with mutex held, providing mutual exclusion against other
>> +        * CPUs also finding out about an offline CPU.
>> +        */
>> +       mutex_lock(&cpu_opv_offline_lock);
>> +       ret = __do_cpu_opv(cpuop, cpuopcnt);
>> +       mutex_unlock(&cpu_opv_offline_lock);
>> +put_online_cpus:
>> +       put_online_cpus();
>> +       return ret;
>> +}
>> +
>> +/*
>> + * cpu_opv - execute operation vector on a given CPU with preempt off.
>> + *
>> + * Userspace should pass current CPU number as parameter. May fail with
>> + * -EAGAIN if currently executing on the wrong CPU.
>> + */
>> +SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
>> +               int, cpu, int, flags)
>> +{
>> +       struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX];
>> +       struct page *pinned_pages_on_stack[NR_PINNED_PAGES_ON_STACK];
>> +       struct cpu_opv_pinned_pages pin_pages = {
>> +               .pages = pinned_pages_on_stack,
>> +               .nr = 0,
>> +               .is_kmalloc = false,
>> +       };
>> +       int ret, i;
>> +
>> +       if (unlikely(flags))
>> +               return -EINVAL;
>> +       if (unlikely(cpu < 0))
>> +               return -EINVAL;
>> +       if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX)
>> +               return -EINVAL;
>> +       if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op)))
>> +               return -EFAULT;
>> +       ret = cpu_opv_check(cpuopv, cpuopcnt);
>> +       if (ret)
>> +               return ret;
>> +       ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &pin_pages);
>> +       if (ret)
>> +               goto end;
>> +       ret = do_cpu_opv(cpuopv, cpuopcnt, cpu);
>> +       for (i = 0; i < pin_pages.nr; i++)
>> +               put_page(pin_pages.pages[i]);
>> +end:
>> +       if (pin_pages.is_kmalloc)
>> +               kfree(pin_pages.pages);
>> +       return ret;
>> +}
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 6bba05f47e51..e547f93a46c2 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1052,6 +1052,43 @@ void do_set_cpus_allowed(struct task_struct *p, const
>> struct cpumask *new_mask)
>>                 set_curr_task(rq, p);
>>  }
>>
>> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu)
>> +{
>> +       struct rq_flags rf;
>> +       struct rq *rq;
>> +       int ret = 0;
>> +
>> +       rq = task_rq_lock(p, &rf);
>> +       update_rq_clock(rq);
>> +
>> +       if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) {
>> +               ret = -EINVAL;
>> +               goto out;
>> +       }
>> +
>> +       if (task_cpu(p) == dest_cpu)
>> +               goto out;
>> +
>> +       if (task_running(rq, p) || p->state == TASK_WAKING) {
>> +               struct migration_arg arg = { p, dest_cpu };
>> +               /* Need help from migration thread: drop lock and wait. */
>> +               task_rq_unlock(rq, p, &rf);
>> +               stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
>> +               tlb_migrate_finish(p->mm);
>> +               return 0;
>> +       } else if (task_on_rq_queued(p)) {
>> +               /*
>> +                * OK, since we're going to drop the lock immediately
>> +                * afterwards anyway.
>> +                */
>> +               rq = move_queued_task(rq, &rf, p, dest_cpu);
>> +       }
>> +out:
>> +       task_rq_unlock(rq, p, &rf);
>> +
>> +       return ret;
>> +}
>> +
>>  /*
>>   * Change a given task's CPU affinity. Migrate the thread to a
>>   * proper CPU and schedule it away if the CPU it's executing on
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 3b448ba82225..cab256c1720a 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1209,6 +1209,8 @@ static inline void __set_task_cpu(struct task_struct *p,
>> unsigned int cpu)
>>  #endif
>>  }
>>
>> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu);
>> +
>>  /*
>>   * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
>>   */
>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>> index bfa1ee1bf669..59e622296dc3 100644
>> --- a/kernel/sys_ni.c
>> +++ b/kernel/sys_ni.c
>> @@ -262,3 +262,4 @@ cond_syscall(sys_pkey_free);
>>
>>  /* restartable sequence */
>>  cond_syscall(sys_rseq);
>> +cond_syscall(sys_cpu_opv);
>> --
>> 2.11.0
>>
>>
>>
> 
> 
> 
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html