(On Tue, Oct 27, 2020 at 9:04 PM Mickaël Salaün <mic@xxxxxxxxxxx> wrote: > Thanks to the Landlock objects and ruleset, it is possible to identify > inodes according to a process's domain. To enable an unprivileged > process to express a file hierarchy, it first needs to open a directory > (or a file) and pass this file descriptor to the kernel through > landlock_add_rule(2). When checking if a file access request is > allowed, we walk from the requested dentry to the real root, following > the different mount layers. The access to each "tagged" inodes are > collected according to their rule layer level, and ANDed to create > access to the requested file hierarchy. This makes possible to identify > a lot of files without tagging every inodes nor modifying the > filesystem, while still following the view and understanding the user > has from the filesystem. > > Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not > keep the same struct inodes for the same inodes whereas these inodes are > in use. > > This commit adds a minimal set of supported filesystem access-control > which doesn't enable to restrict all file-related actions. This is the > result of multiple discussions to minimize the code of Landlock to ease > review. Thanks to the Landlock design, extending this access-control > without breaking user space will not be a problem. Moreover, seccomp > filters can be used to restrict the use of syscall families which may > not be currently handled by Landlock. [...] > diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h [...] > +/** > + * DOC: fs_access > + * > + * A set of actions on kernel objects may be defined by an attribute (e.g. > + * &struct landlock_path_beneath_attr) including a bitmask of access. > + * > + * Filesystem flags > + * ~~~~~~~~~~~~~~~~ > + * > + * These flags enable to restrict a sandbox process to a set of actions on s/sandbox/sandboxed/ [...] > diff --git a/security/landlock/fs.c b/security/landlock/fs.c [...] > +static const struct landlock_object_underops landlock_fs_underops = { > + .release = release_inode > +}; [...] > +/* Access-control management */ > + > +static bool check_access_path_continue( > + const struct landlock_ruleset *const domain, > + const struct path *const path, const u32 access_request, > + bool *const allow, u64 *const layer_mask) > +{ > + const struct landlock_rule *rule; > + const struct inode *inode; > + bool next = true; > + > + prefetch(path->dentry->d_parent); IIRC software prefetch() turned out to only rarely actually have a performance benefit, and they often actually make things worse; see e.g. <https://lwn.net/Articles/444336/>. Unless you have strong evidence that this actually brings a performance benefit, I'd probably get rid of this. > + if (d_is_negative(path->dentry)) > + /* Continues to walk while there is no mapped inode. */ > + return true; > + inode = d_backing_inode(path->dentry); > + rcu_read_lock(); > + rule = landlock_find_rule(domain, > + rcu_dereference(landlock_inode(inode)->object)); > + rcu_read_unlock(); > + > + /* Checks for matching layers. */ > + if (rule && (rule->layers | *layer_mask)) { > + *allow = (rule->access & access_request) == access_request; > + if (*allow) { > + *layer_mask &= ~rule->layers; > + /* Stops when a rule from each layer granted access. */ > + next = !!*layer_mask; > + } else { > + next = false; > + } > + } > + return next; > +} > + > +static int check_access_path(const struct landlock_ruleset *const domain, > + const struct path *const path, u32 access_request) > +{ > + bool allow = false; > + struct path walker_path; > + u64 layer_mask; > + > + if (WARN_ON_ONCE(!domain || !path)) > + return 0; > + /* > + * Allows access to pseudo filesystems that will never be mountable > + * (e.g. sockfs, pipefs), but can still be reachable through > + * /proc/self/fd . > + */ > + if ((path->dentry->d_sb->s_flags & SB_NOUSER) || > + (d_is_positive(path->dentry) && > + unlikely(IS_PRIVATE(d_backing_inode(path->dentry))))) > + return 0; > + if (WARN_ON_ONCE(domain->nb_layers < 1)) > + return -EACCES; > + > + layer_mask = GENMASK_ULL(domain->nb_layers - 1, 0); > + /* > + * An access request which is not handled by the domain should be > + * allowed. > + */ > + access_request &= domain->fs_access_mask; > + if (access_request == 0) > + return 0; > + walker_path = *path; > + path_get(&walker_path); > + /* > + * We need to walk through all the hierarchy to not miss any relevant > + * restriction. > + */ > + while (check_access_path_continue(domain, &walker_path, access_request, > + &allow, &layer_mask)) { The logic in this code might be clearer if check_access_path_continue() just returns whether the rule permitted the access. Then it'd look like: bool allow = false; [...] while (check_access_path_continue(domain, &walker_path, access_request, &layer_mask)) { if (layer_mask == 0) { allow = true; break; } [...] } I think that would make it clearer under which conditions we can end up returning "true" from check_access_path(). (The current code also looks correct to me, I just think it'd be clearer this way. If you disagree, you can keep it as-is.) > + struct dentry *parent_dentry; > + > +jump_up: > + /* > + * Does not work with orphaned/private mounts like overlayfs > + * layers for now (cf. ovl_path_real() and ovl_path_open()). > + */ > + if (walker_path.dentry == walker_path.mnt->mnt_root) { > + if (follow_up(&walker_path)) { > + /* Ignores hidden mount points. */ > + goto jump_up; > + } else { > + /* > + * Stops at the real root. Denies access > + * because not all layers have granted access. > + */ > + allow = false; > + break; > + } > + } > + if (unlikely(IS_ROOT(walker_path.dentry))) { > + /* > + * Stops at disconnected root directories. Only allows > + * access to internal filesystems (e.g. nsfs which is > + * reachable through /proc/self/ns). > + */ > + allow = !!(walker_path.mnt->mnt_flags & MNT_INTERNAL); > + break; > + } > + parent_dentry = dget_parent(walker_path.dentry); > + dput(walker_path.dentry); > + walker_path.dentry = parent_dentry; > + } > + path_put(&walker_path); > + return allow ? 0 : -EACCES; > +} [...] > +static inline u32 get_file_access(const struct file *const file) > +{ > + u32 access = 0; > + > + if (file->f_mode & FMODE_READ) { > + /* A directory can only be opened in read mode. */ > + if (S_ISDIR(file_inode(file)->i_mode)) > + return LANDLOCK_ACCESS_FS_READ_DIR; > + access = LANDLOCK_ACCESS_FS_READ_FILE; > + } > + /* > + * A LANDLOCK_ACCESS_FS_APPEND could be added but we also need to check > + * fcntl(2). > + */ Once https://lore.kernel.org/linux-api/20200831153207.GO3265@xxxxxxxxxxxxxxxxxxxxx/ lands, pwritev2() with RWF_NOAPPEND will also be problematic for classifying "write" vs "append"; you may want to include that in the comment. (Or delete the comment.) > + if (file->f_mode & FMODE_WRITE) > + access |= LANDLOCK_ACCESS_FS_WRITE_FILE; > + /* __FMODE_EXEC is indeed part of f_flags, not f_mode. */ > + if (file->f_flags & __FMODE_EXEC) > + access |= LANDLOCK_ACCESS_FS_EXECUTE; > + return access; > +} [...]