On 6/10/2019 8:21 AM, Stephen Smalley wrote: > On 6/7/19 10:17 AM, David Howells wrote: >> >> Hi Al, >> >> Here's a set of patches to add a general variable-length notification queue >> concept and to add sources of events for: >> >> (1) Mount topology events, such as mounting, unmounting, mount expiry, >> mount reconfiguration. >> >> (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O >> errors (not complete yet). >> >> (3) Key/keyring events, such as creating, linking and removal of keys. >> >> (4) General device events (single common queue) including: >> >> - Block layer events, such as device errors >> >> - USB subsystem events, such as device/bus attach/remove, device >> reset, device errors. >> >> One of the reasons for this is so that we can remove the issue of processes >> having to repeatedly and regularly scan /proc/mounts, which has proven to >> be a system performance problem. To further aid this, the fsinfo() syscall >> on which this patch series depends, provides a way to access superblock and >> mount information in binary form without the need to parse /proc/mounts. >> >> >> LSM support is included, but controversial: >> >> (1) The creds of the process that did the fput() that reduced the refcount >> to zero are cached in the file struct. >> >> (2) __fput() overrides the current creds with the creds from (1) whilst >> doing the cleanup, thereby making sure that the creds seen by the >> destruction notification generated by mntput() appears to come from >> the last fputter. >> >> (3) security_post_notification() is called for each queue that we might >> want to post a notification into, thereby allowing the LSM to prevent >> covert communications. >> >> (?) Do I need to add security_set_watch(), say, to rule on whether a watch >> may be set in the first place? I might need to add a variant per >> watch-type. >> >> (?) Do I really need to keep track of the process creds in which an >> implicit object destruction happened? For example, imagine you create >> an fd with fsopen()/fsmount(). It is marked to dissolve the mount it >> refers to on close unless move_mount() clears that flag. Now, imagine >> someone looking at that fd through procfs at the same time as you exit >> due to an error. The LSM sees the destruction notification come from >> the looker if they happen to do their fput() after yours. > > I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea. > > For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type. > > I wouldn't expect SELinux to implement security_post_notification() at all. I can't see how one can construct a meaningful, stable policy for it. I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it. I cannot agree. There is an explicit action by a subject that results in information being delivered to an object. Just like a signal or a UDP packet delivery. Smack handles this kind of thing just fine. The internal mechanism that results in the access is irrelevant from this viewpoint. I can understand how a mechanism like SELinux that works on finer granularity might view it differently. > >> >> >> Design decisions: >> >> (1) A misc chardev is used to create and open a ring buffer: >> >> fd = open("/dev/watch_queue", O_RDWR); >> >> which is then configured and mmap'd into userspace: >> >> ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE); >> ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter); >> buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE, >> MAP_SHARED, fd, 0); >> >> The fd cannot be read or written (though there is a facility to use >> write to inject records for debugging) and userspace just pulls data >> directly out of the buffer. >> >> (2) The ring index pointers are stored inside the ring and are thus >> accessible to userspace. Userspace should only update the tail >> pointer and never the head pointer or risk breaking the buffer. The >> kernel checks that the pointers appear valid before trying to use >> them. A 'skip' record is maintained around the pointers. >> >> (3) poll() can be used to wait for data to appear in the buffer. >> >> (4) Records in the buffer are binary, typed and have a length so that they >> can be of varying size. >> >> This means that multiple heterogeneous sources can share a common >> buffer. Tags may be specified when a watchpoint is created to help >> distinguish the sources. >> >> (5) The queue is reusable as there are 16 million types available, of >> which I've used 4, so there is scope for others to be used. >> >> (6) Records are filterable as types have up to 256 subtypes that can be >> individually filtered. Other filtration is also available. >> >> (7) Each time the buffer is opened, a new buffer is created - this means >> that there's no interference between watchers. >> >> (8) When recording a notification, the kernel will not sleep, but will >> rather mark a queue as overrun if there's insufficient space, thereby >> avoiding userspace causing the kernel to hang. >> >> (9) The 'watchpoint' should be specific where possible, meaning that you >> specify the object that you want to watch. >> >> (10) The buffer is created and then watchpoints are attached to it, using >> one of: >> >> keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01); >> mount_notify(AT_FDCWD, "/", 0, fd, 0x02); >> sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03); >> >> where in all three cases, fd indicates the queue and the number after >> is a tag between 0 and 255. >> >> (11) The watch must be removed if either the watch buffer is destroyed or >> the watched object is destroyed. >> >> >> Things I want to avoid: >> >> (1) Introducing features that make the core VFS dependent on the network >> stack or networking namespaces (ie. usage of netlink). >> >> (2) Dumping all this stuff into dmesg and having a daemon that sits there >> parsing the output and distributing it as this then puts the >> responsibility for security into userspace and makes handling >> namespaces tricky. Further, dmesg might not exist or might be >> inaccessible inside a container. >> >> (3) Letting users see events they shouldn't be able to see. >> >> >> Further things that could be considered: >> >> (1) Adding a keyctl call to allow a watch on a keyring to be extended to >> "children" of that keyring, such that the watch is removed from the >> child if it is unlinked from the keyring. >> >> (2) Adding global superblock event queue. >> >> (3) Propagating watches to child superblock over automounts. >> >> >> The patches can be found here also: >> >> http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications >> >> Changes: >> >> v4: Split the basic UAPI bits out into their own patch and then split the >> LSM hooks out into an intermediate patch. Add LSM hooks for setting >> watches. >> >> Rename the *_notify() system calls to watch_*() for consistency. >> >> v3: I've added a USB notification source and reformulated the block >> notification source so that there's now a common watch list, for which >> the system call is now device_notify(). >> >> I've assigned a pair of unused ioctl numbers in the 'W' series to the >> ioctls added by this series. >> >> I've also added a description of the kernel API to the documentation. >> >> v2: I've fixed various issues raised by Jann Horn and GregKH and moved to >> krefs for refcounting. I've added some security features to try and >> give Casey Schaufler the LSM control he wants. >> >> David >> --- >> David Howells (13): >> security: Override creds in __fput() with last fputter's creds >> uapi: General notification ring definitions >> security: Add hooks to rule on setting a watch >> security: Add a hook for the point of notification insertion >> General notification queue with user mmap()'able ring buffer >> keys: Add a notification facility >> vfs: Add a mount-notification facility >> vfs: Add superblock notifications >> fsinfo: Export superblock notification counter >> Add a general, global device notification watch list >> block: Add block layer notifications >> usb: Add USB subsystem notifications >> Add sample notification program >> >> >> Documentation/ioctl/ioctl-number.txt | 1 >> Documentation/security/keys/core.rst | 58 ++ >> Documentation/watch_queue.rst | 492 ++++++++++++++++++ >> arch/x86/entry/syscalls/syscall_32.tbl | 3 >> arch/x86/entry/syscalls/syscall_64.tbl | 3 >> block/Kconfig | 9 >> block/blk-core.c | 29 + >> drivers/base/Kconfig | 9 >> drivers/base/Makefile | 1 >> drivers/base/watch.c | 89 +++ >> drivers/misc/Kconfig | 13 >> drivers/misc/Makefile | 1 >> drivers/misc/watch_queue.c | 889 ++++++++++++++++++++++++++++++++ >> drivers/usb/core/Kconfig | 10 >> drivers/usb/core/devio.c | 55 ++ >> drivers/usb/core/hub.c | 3 >> fs/Kconfig | 21 + >> fs/Makefile | 1 >> fs/file_table.c | 12 >> fs/fsinfo.c | 12 >> fs/mount.h | 33 + >> fs/mount_notify.c | 187 +++++++ >> fs/namespace.c | 9 >> fs/super.c | 122 ++++ >> include/linux/blkdev.h | 15 + >> include/linux/dcache.h | 1 >> include/linux/device.h | 7 >> include/linux/fs.h | 79 +++ >> include/linux/key.h | 4 >> include/linux/lsm_hooks.h | 48 ++ >> include/linux/security.h | 35 + >> include/linux/syscalls.h | 5 >> include/linux/usb.h | 19 + >> include/linux/watch_queue.h | 87 +++ >> include/uapi/linux/fsinfo.h | 10 >> include/uapi/linux/keyctl.h | 1 >> include/uapi/linux/watch_queue.h | 213 ++++++++ >> kernel/sys_ni.c | 7 >> samples/Kconfig | 6 >> samples/Makefile | 1 >> samples/vfs/test-fsinfo.c | 13 >> samples/watch_queue/Makefile | 9 >> samples/watch_queue/watch_test.c | 308 +++++++++++ >> security/keys/Kconfig | 10 >> security/keys/compat.c | 2 >> security/keys/gc.c | 5 >> security/keys/internal.h | 30 + >> security/keys/key.c | 37 + >> security/keys/keyctl.c | 95 +++ >> security/keys/keyring.c | 17 - >> security/keys/request_key.c | 4 >> security/security.c | 29 + >> 52 files changed, 3121 insertions(+), 38 deletions(-) >> create mode 100644 Documentation/watch_queue.rst >> create mode 100644 drivers/base/watch.c >> create mode 100644 drivers/misc/watch_queue.c >> create mode 100644 fs/mount_notify.c >> create mode 100644 include/linux/watch_queue.h >> create mode 100644 include/uapi/linux/watch_queue.h >> create mode 100644 samples/watch_queue/Makefile >> create mode 100644 samples/watch_queue/watch_test.c >> >