[RFC] io_uring: add restrictions to support untrusted applications and guests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jens,
Stefan and I have a proposal to share with io_uring community.
Before implementing it we would like to discuss it to receive feedbacks and
to see if it could be accepted:

Adding restrictions to io_uring
=====================================
The io_uring API provides submission and completion queues for performing
asynchronous I/O operations. The queues are located in memory that is
accessible to both the host userspace application and the kernel, making it
possible to monitor for activity through polling instead of system calls. This
design offers good performance and this makes exposing io_uring to guests an
attractive idea for improving I/O performance in virtualization.

PoC and preliminary benchmarks
---------------------------
We realized a PoC, using QEMU and virtio-blk device, to share io_uring
CQ and SQ rings with the guest.
QEMU initializes io_uring, registers the device (NVMe) fd through
io_uring_register(2), and maps the rings in the guest memory.
The virtio-blk driver uses these rings to send requests instead of using
the standard virtqueues.

The PoC implements a pure polling solution where the application is polling
(IOPOLL enabled) in the guest and the sqpoll_kthread is polling in the host
(SQPOLL and IOPOLL enabled).

These are the encouraging results we obtained from this preliminary work;
we used fio (rw=randread bs=4k) to measure the kIOPS on a NVMe device:

- bare-metal
                                                       iodepth
  | fio ioengine                              |  1  |  8  |  16 |  32 |
  |-------------------------------------------|----:|----:|----:|----:|
  | io_uring (SQPOLL + IOPOLL)                | 119 | 550 | 581 | 585 |
  | io_uring (IOPOLL)                         | 122 | 502 | 519 | 538 |

- QEMU/KVM guest (aio=io_uring)
                                                       iodepth
  | virtio-blk            | fio ioengine      |  1  |  8  |  16 |  32 |
  |-----------------------|-------------------|----:|----:|----:|----:|
  | virtqueues            | io_uring (IOPOLL) |  27 | 144 | 209 | 266 |
  | virtqueues + iothread | io_uring (IOPOLL) |  73 | 264 | 306 | 312 |
  | io_uring passthrough  | io_uring (IOPOLL) | 104 | 532 | 577 | 585 |

  All guest experiments are using the QEMU io_uring backend with SQPOLL and
  IOPOLL enabled. The virtio-blk driver is modified to support blovk io_poll
  on both virtqueues and io_uring passthrough.

Before developing this proof-of-concept further we would like to discuss
io_uring changes required to restrict rings since this mechanism is a
prerequisite for real-world use cases where guests are untrusted.

Restrictions
------------
This document proposes io_uring API changes that safely allow untrusted
applications or guests to use io_uring. io_uring's existing security model is
that of kernel system call handler code. It is designed to reject invalid
inputs from host userspace applications. Supporting guests as io_uring API
clients adds a new trust domain with access to even fewer resources than host
userspace applications.

Guests do not have direct access to host userspace application file descriptors
or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
as QEMU, grants access to a subset of its file descriptors and memory. The
allowed file descriptors are typically the disk image files belonging to the
guest. The memory is typically the virtual machine's RAM that the VMM has
allocated on behalf of the guest.

The following extensions to the io_uring API allow the host application to
grant access to some of its file descriptors.

These extensions are designed to be applicable to other use cases besides
untrusted guests and are not virtualization-specific. For example, the
restrictions can be used to allow only a subset of sqe operations available to
an application similar to seccomp syscall whitelisting.

An address translation and memory restriction mechanism would also be
necessary, but we can discuss this later.

The IOURING_REGISTER_RESTRICTIONS opcode
----------------------------------------
The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
passed to untrusted code with the knowledge that only operations present in the
whitelist can be executed.

The whitelist approach ensures that new features added to io_uring do not
accidentally become available when an existing application is launched on a
newer kernel version.

The IORING_REGISTER_RESTRICTIONS opcode takes an array of struct
io_uring_restriction elements that describe whitelisted features:

  #define IORING_REGISTER_RESTRICTIONS 11

  /* struct io_uring_restriction::opcode values */
  enum {
      /* Allow an io_uring_register(2) opcode */
      IORING_RESTRICTION_REGISTER_OP,

      /* Allow an sqe opcode */
      IORING_RESTRICTION_SQE_OP,

      /* Only allow fixed files */
      IORING_RESTRICTION_FIXED_FILES_ONLY,

      /* Only allow registered addresses and translate them */
      IORING_RESTRICTION_BUFFER_CHECK
  };

  struct io_uring_restriction {
      __u16 opcode;
      union {
          __u8 register_op; /* IORING_RESTRICTION_REGISTER_OP */
          __u8 sqe_op;      /* IORING_RESTRICTION_SQE_OP */
      };
      __u8 resv;
      __u32 resv2[3];
  };

This call can only be made once. Afterwards it is not possible to change
restrictions anymore. This prevents untrusted code from removing restrictions.

Limiting access to io_uring operations
--------------------------------------
The following example shows how to whitelist IORING_OP_READV, IORING_OP_WRITEV,
and IORING_OP_FSYNC:

  struct io_uring_restriction restrictions[] = {
      {
          .opcode = IORING_RESTRICTION_SQE_OP,
          .sqe_op = IORING_OP_READV,
      },
      {
          .opcode = IORING_RESTRICTION_SQE_OP,
          .sqe_op = IORING_OP_WRITEV,
      },
      {
          .opcode = IORING_RESTRICTION_SQE_OP,
          .sqe_op = IORING_OP_FSYNC,
      },
      ...
  };

  io_uring_register(ringfd, IORING_REGISTER_RESTRICTIONS,
                    restrictions, ARRAY_SIZE(restrictions));

Limiting access to file descriptors
-----------------------------------
The fixed files mechanism can be used to limit access to a set of file
descriptors:

  struct io_uring_restriction restrictions[] = {
      {
          .opcode = IORING_RESTRICTION_FIXED_FILES_ONLY,
      },
      ...
  };

  io_uring_register(ringfd, IORING_REGISTER_RESTRICTIONS,
                    restrictions, ARRAY_SIZE(restrictions));

Only requests with the sqe->flags IOSQE_FIXED_FILE bit set will be allowed.


Thanks for your feedback,
Stefano




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux