Re: [RFC][PATCH 00/10] Mount, FS, Block and Keyrings notifications [ver #3]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/6/19 5:41 AM, David Howells wrote:

Hi Al,

Here's a set of patches to add a general variable-length notification queue
concept and to add sources of events for:

  (1) Mount topology events, such as mounting, unmounting, mount expiry,
      mount reconfiguration.

  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
      errors (not complete yet).

  (3) Key/keyring events, such as creating, linking and removal of keys.

  (4) General device events (single common queue) including:

      - Block layer events, such as device errors

      - USB subsystem events, such as device/bus attach/remove, device
        reset, device errors.

One of the reasons for this is so that we can remove the issue of processes
having to repeatedly and regularly scan /proc/mounts, which has proven to
be a system performance problem.  To further aid this, the fsinfo() syscall
on which this patch series depends, provides a way to access superblock and
mount information in binary form without the need to parse /proc/mounts.


LSM support is included, but controversial:

  (1) The creds of the process that did the fput() that reduced the refcount
      to zero are cached in the file struct.

  (2) __fput() overrides the current creds with the creds from (1) whilst
      doing the cleanup, thereby making sure that the creds seen by the
      destruction notification generated by mntput() appears to come from
      the last fputter.

  (3) security_post_notification() is called for each queue that we might
      want to post a notification into, thereby allowing the LSM to prevent
      covert communications.

  (?) Do I need to add security_set_watch(), say, to rule on whether a watch
      may be set in the first place?  I might need to add a variant per
      watch-type.

  (?) Do I really need to keep track of the process creds in which an
      implicit object destruction happened?  For example, imagine you create
      an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
      refers to on close unless move_mount() clears that flag.  Now, imagine
      someone looking at that fd through procfs at the same time as you exit
      due to an error.  The LSM sees the destruction notification come from
      the looker if they happen to do their fput() after yours.


I'm not in favor of this approach. Can we check permission to the object being watched when a watch is set (read-like access), make sure every access that can trigger a notification requires a (write-like) permission to the accessed object, and make sure there is some sane way to control the relationship between the accessed object and the watched object (write-like)? For cases where we have no object per se or at least no security structure/label associated with it, we may have to fall back to a coarse-grained "Can the watcher get this kind of notification in general?".



Design decisions:

  (1) A misc chardev is used to create and open a ring buffer:

	fd = open("/dev/watch_queue", O_RDWR);

      which is then configured and mmap'd into userspace:

	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

      The fd cannot be read or written (though there is a facility to use
      write to inject records for debugging) and userspace just pulls data
      directly out of the buffer.

  (2) The ring index pointers are stored inside the ring and are thus
      accessible to userspace.  Userspace should only update the tail
      pointer and never the head pointer or risk breaking the buffer.  The
      kernel checks that the pointers appear valid before trying to use
      them.  A 'skip' record is maintained around the pointers.

  (3) poll() can be used to wait for data to appear in the buffer.

  (4) Records in the buffer are binary, typed and have a length so that they
      can be of varying size.

      This means that multiple heterogeneous sources can share a common
      buffer.  Tags may be specified when a watchpoint is created to help
      distinguish the sources.

  (5) The queue is reusable as there are 16 million types available, of
      which I've used 4, so there is scope for others to be used.

  (6) Records are filterable as types have up to 256 subtypes that can be
      individually filtered.  Other filtration is also available.

  (7) Each time the buffer is opened, a new buffer is created - this means
      that there's no interference between watchers.

  (8) When recording a notification, the kernel will not sleep, but will
      rather mark a queue as overrun if there's insufficient space, thereby
      avoiding userspace causing the kernel to hang.

  (9) The 'watchpoint' should be specific where possible, meaning that you
      specify the object that you want to watch.

(10) The buffer is created and then watchpoints are attached to it, using
      one of:

	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
	mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
	sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);

      where in all three cases, fd indicates the queue and the number after
      is a tag between 0 and 255.

(11) The watch must be removed if either the watch buffer is destroyed or
      the watched object is destroyed.


Things I want to avoid:

  (1) Introducing features that make the core VFS dependent on the network
      stack or networking namespaces (ie. usage of netlink).

  (2) Dumping all this stuff into dmesg and having a daemon that sits there
      parsing the output and distributing it as this then puts the
      responsibility for security into userspace and makes handling
      namespaces tricky.  Further, dmesg might not exist or might be
      inaccessible inside a container.

  (3) Letting users see events they shouldn't be able to see.


Further things that could be considered:

  (1) Adding a keyctl call to allow a watch on a keyring to be extended to
      "children" of that keyring, such that the watch is removed from the
      child if it is unlinked from the keyring.

  (2) Adding global superblock event queue.

  (3) Propagating watches to child superblock over automounts.


The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications

Changes:

  v3: I've added a USB notification source and reformulated the block
      notification source so that there's now a common watch list, for which
      the system call is now device_notify().

      I've assigned a pair of unused ioctl numbers in the 'W' series to the
      ioctls added by this series.

      I've also added a description of the kernel API to the documentation.

  v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
      krefs for refcounting.  I've added some security features to try and
      give Casey Schaufler the LSM control he wants.

David
---
David Howells (10):
       security: Override creds in __fput() with last fputter's creds
       General notification queue with user mmap()'able ring buffer
       keys: Add a notification facility
       vfs: Add a mount-notification facility
       vfs: Add superblock notifications
       fsinfo: Export superblock notification counter
       Add a general, global device notification watch list
       block: Add block layer notifications
       usb: Add USB subsystem notifications
       Add sample notification program


  Documentation/ioctl/ioctl-number.txt   |    1
  Documentation/security/keys/core.rst   |   58 ++
  Documentation/watch_queue.rst          |  492 ++++++++++++++++++
  arch/x86/entry/syscalls/syscall_32.tbl |    3
  arch/x86/entry/syscalls/syscall_64.tbl |    3
  block/Kconfig                          |    9
  block/blk-core.c                       |   29 +
  drivers/base/Kconfig                   |    9
  drivers/base/Makefile                  |    1
  drivers/base/notify.c                  |   82 +++
  drivers/misc/Kconfig                   |   13
  drivers/misc/Makefile                  |    1
  drivers/misc/watch_queue.c             |  889 ++++++++++++++++++++++++++++++++
  drivers/usb/core/Kconfig               |   10
  drivers/usb/core/devio.c               |   55 ++
  drivers/usb/core/hub.c                 |    3
  fs/Kconfig                             |   21 +
  fs/Makefile                            |    1
  fs/file_table.c                        |   12
  fs/fsinfo.c                            |   12
  fs/mount.h                             |   33 +
  fs/mount_notify.c                      |  180 ++++++
  fs/namespace.c                         |    9
  fs/super.c                             |  116 ++++
  include/linux/blkdev.h                 |   15 +
  include/linux/dcache.h                 |    1
  include/linux/device.h                 |    7
  include/linux/fs.h                     |   79 +++
  include/linux/key.h                    |    4
  include/linux/lsm_hooks.h              |   15 +
  include/linux/security.h               |   14 +
  include/linux/syscalls.h               |    5
  include/linux/usb.h                    |   19 +
  include/linux/watch_queue.h            |   87 +++
  include/uapi/linux/fsinfo.h            |   10
  include/uapi/linux/keyctl.h            |    1
  include/uapi/linux/watch_queue.h       |  213 ++++++++
  kernel/sys_ni.c                        |    7
  mm/interval_tree.c                     |    2
  mm/memory.c                            |    1
  samples/Kconfig                        |    6
  samples/Makefile                       |    1
  samples/vfs/test-fsinfo.c              |   13
  samples/watch_queue/Makefile           |    9
  samples/watch_queue/watch_test.c       |  310 +++++++++++
  security/keys/Kconfig                  |   10
  security/keys/compat.c                 |    2
  security/keys/gc.c                     |    5
  security/keys/internal.h               |   30 +
  security/keys/key.c                    |   37 +
  security/keys/keyctl.c                 |   88 +++
  security/keys/keyring.c                |   17 -
  security/keys/request_key.c            |    4
  security/security.c                    |    9
  54 files changed, 3025 insertions(+), 38 deletions(-)
  create mode 100644 Documentation/watch_queue.rst
  create mode 100644 drivers/base/notify.c
  create mode 100644 drivers/misc/watch_queue.c
  create mode 100644 fs/mount_notify.c
  create mode 100644 include/linux/watch_queue.h
  create mode 100644 include/uapi/linux/watch_queue.h
  create mode 100644 samples/watch_queue/Makefile
  create mode 100644 samples/watch_queue/watch_test.c





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux