Hi Hugh On Wed, Jul 16, 2014 at 12:06 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote: > On Fri, 13 Jun 2014, David Herrmann wrote: > >> If two processes share a common memory region, they usually want some >> guarantees to allow safe access. This often includes: >> - one side cannot overwrite data while the other reads it >> - one side cannot shrink the buffer while the other accesses it >> - one side cannot grow the buffer beyond previously set boundaries >> >> If there is a trust-relationship between both parties, there is no need >> for policy enforcement. However, if there's no trust relationship (eg., >> for general-purpose IPC) sharing memory-regions is highly fragile and >> often not possible without local copies. Look at the following two >> use-cases: >> 1) A graphics client wants to share its rendering-buffer with a >> graphics-server. The memory-region is allocated by the client for >> read/write access and a second FD is passed to the server. While >> scanning out from the memory region, the server has no guarantee that >> the client doesn't shrink the buffer at any time, requiring rather >> cumbersome SIGBUS handling. >> 2) A process wants to perform an RPC on another process. To avoid huge >> bandwidth consumption, zero-copy is preferred. After a message is >> assembled in-memory and a FD is passed to the remote side, both sides >> want to be sure that neither modifies this shared copy, anymore. The >> source may have put sensible data into the message without a separate >> copy and the target may want to parse the message inline, to avoid a >> local copy. >> >> While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide >> ways to achieve most of this, the first one is unproportionally ugly to >> use in libraries and the latter two are broken/racy or even disabled due >> to denial of service attacks. >> >> This patch introduces the concept of SEALING. If you seal a file, a >> specific set of operations is blocked on that file forever. >> Unlike locks, seals can only be set, never removed. Hence, once you >> verified a specific set of seals is set, you're guaranteed that no-one can >> perform the blocked operations on this file, anymore. >> >> An initial set of SEALS is introduced by this patch: >> - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced >> in size. This affects ftruncate() and open(O_TRUNC). >> - GROW: If SEAL_GROW is set, the file in question cannot be increased >> in size. This affects ftruncate(), fallocate() and write(). >> - WRITE: If SEAL_WRITE is set, no write operations (besides resizing) >> are possible. This affects fallocate(PUNCH_HOLE), mmap() and >> write(). >> - SEAL: If SEAL_SEAL is set, no further seals can be added to a file. >> This basically prevents the F_ADD_SEAL operation on a file and >> can be set to prevent others from adding further seals that you >> don't want. >> >> The described use-cases can easily use these seals to provide safe use >> without any trust-relationship: >> 1) The graphics server can verify that a passed file-descriptor has >> SEAL_SHRINK set. This allows safe scanout, while the client is >> allowed to increase buffer size for window-resizing on-the-fly. >> Concurrent writes are explicitly allowed. >> 2) For general-purpose IPC, both processes can verify that SEAL_SHRINK, >> SEAL_GROW and SEAL_WRITE are set. This guarantees that neither >> process can modify the data while the other side parses it. >> Furthermore, it guarantees that even with writable FDs passed to the >> peer, it cannot increase the size to hit memory-limits of the source >> process (in case the file-storage is accounted to the source). >> >> The new API is an extension to fcntl(), adding two new commands: >> F_GET_SEALS: Return a bitset describing the seals on the file. This >> can be called on any FD if the underlying file supports >> sealing. >> F_ADD_SEALS: Change the seals of a given file. This requires WRITE >> access to the file and F_SEAL_SEAL may not already be set. >> Furthermore, the underlying file must support sealing and >> there may not be any existing shared mapping of that file. >> Otherwise, EBADF/EPERM is returned. >> The given seals are _added_ to the existing set of seals >> on the file. You cannot remove seals again. >> >> The fcntl() handler is currently specific to shmem and disabled on all >> files. A file needs to explicitly support sealing for this interface to >> work. A separate syscall is added in a follow-up, which creates files that >> support sealing. There is no intention to support this on other >> file-systems. Semantics are unclear for non-volatile files and we lack any >> use-case right now. Therefore, the implementation is specific to shmem. >> >> Signed-off-by: David Herrmann <dh.herrmann@xxxxxxxxx> > > Looks pretty good to me, minor comments below. > >> --- >> fs/fcntl.c | 5 ++ >> include/linux/shmem_fs.h | 17 ++++ >> include/uapi/linux/fcntl.h | 15 ++++ >> mm/shmem.c | 189 ++++++++++++++++++++++++++++++++++++++++++++- >> 4 files changed, 223 insertions(+), 3 deletions(-) >> >> diff --git a/fs/fcntl.c b/fs/fcntl.c >> index 72c82f6..22d1c3d 100644 >> --- a/fs/fcntl.c >> +++ b/fs/fcntl.c >> @@ -21,6 +21,7 @@ >> #include <linux/rcupdate.h> >> #include <linux/pid_namespace.h> >> #include <linux/user_namespace.h> >> +#include <linux/shmem_fs.h> >> >> #include <asm/poll.h> >> #include <asm/siginfo.h> >> @@ -336,6 +337,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, >> case F_GETPIPE_SZ: >> err = pipe_fcntl(filp, cmd, arg); >> break; >> + case F_ADD_SEALS: >> + case F_GET_SEALS: >> + err = shmem_fcntl(filp, cmd, arg); >> + break; >> default: >> break; >> } >> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h >> index 4d1771c..50777b5 100644 >> --- a/include/linux/shmem_fs.h >> +++ b/include/linux/shmem_fs.h >> @@ -1,6 +1,7 @@ >> #ifndef __SHMEM_FS_H >> #define __SHMEM_FS_H >> >> +#include <linux/file.h> >> #include <linux/swap.h> >> #include <linux/mempolicy.h> >> #include <linux/pagemap.h> >> @@ -11,6 +12,7 @@ >> >> struct shmem_inode_info { >> spinlock_t lock; >> + unsigned int seals; /* shmem seals */ >> unsigned long flags; >> unsigned long alloced; /* data pages alloced to file */ >> union { >> @@ -65,4 +67,19 @@ static inline struct page *shmem_read_mapping_page( >> mapping_gfp_mask(mapping)); >> } >> >> +#ifdef CONFIG_TMPFS >> + >> +extern int shmem_add_seals(struct file *file, unsigned int seals); >> +extern int shmem_get_seals(struct file *file); >> +extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg); >> + >> +#else >> + >> +static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a) >> +{ >> + return -EINVAL; >> +} >> + >> +#endif >> + >> #endif >> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h >> index 074b886..beed138 100644 >> --- a/include/uapi/linux/fcntl.h >> +++ b/include/uapi/linux/fcntl.h >> @@ -28,6 +28,21 @@ >> #define F_GETPIPE_SZ (F_LINUX_SPECIFIC_BASE + 8) >> >> /* >> + * Set/Get seals >> + */ >> +#define F_ADD_SEALS (F_LINUX_SPECIFIC_BASE + 9) >> +#define F_GET_SEALS (F_LINUX_SPECIFIC_BASE + 10) >> + >> +/* >> + * Types of seals >> + */ >> +#define F_SEAL_SEAL 0x0001 /* prevent further seals from being set */ >> +#define F_SEAL_SHRINK 0x0002 /* prevent file from shrinking */ >> +#define F_SEAL_GROW 0x0004 /* prevent file from growing */ >> +#define F_SEAL_WRITE 0x0008 /* prevent writes */ >> +/* (1U << 31) is reserved for signed error codes */ >> + >> +/* >> * Types of directory notifications that may be requested. >> */ >> #define DN_ACCESS 0x00000001 /* File accessed */ >> diff --git a/mm/shmem.c b/mm/shmem.c >> index f484c27..1438b3e 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt; >> #include <linux/highmem.h> >> #include <linux/seq_file.h> >> #include <linux/magic.h> >> +#include <linux/fcntl.h> >> >> #include <asm/uaccess.h> >> #include <asm/pgtable.h> >> @@ -531,16 +532,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range); >> static int shmem_setattr(struct dentry *dentry, struct iattr *attr) >> { >> struct inode *inode = dentry->d_inode; >> + struct shmem_inode_info *info = SHMEM_I(inode); >> + loff_t oldsize = inode->i_size; >> + loff_t newsize = attr->ia_size; >> int error; >> >> error = inode_change_ok(inode, attr); >> if (error) >> return error; >> >> - if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) { >> - loff_t oldsize = inode->i_size; >> - loff_t newsize = attr->ia_size; >> + /* protected by i_mutex */ >> + if (attr->ia_valid & ATTR_SIZE) { >> + if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) || >> + (newsize > oldsize && (info->seals & F_SEAL_GROW))) >> + return -EPERM; >> + } > > Not important but... > I'd have thought all that was better inside the S_ISREG(inode->i_mode) && > (attr->ia_valid & ATTR_SIZE) block. Less unnecessary change above, and > more efficient for non-size attrs. You cannot seal anything but a regular > file anyway, right? You're right. Fixed. >> >> + if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) { >> if (newsize != oldsize) { >> i_size_write(inode, newsize); >> inode->i_ctime = inode->i_mtime = CURRENT_TIME; >> @@ -1315,6 +1323,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode >> info = SHMEM_I(inode); >> memset(info, 0, (char *)inode - (char *)info); >> spin_lock_init(&info->lock); >> + info->seals = F_SEAL_SEAL; >> info->flags = flags & VM_NORESERVE; >> INIT_LIST_HEAD(&info->swaplist); >> simple_xattrs_init(&info->xattrs); >> @@ -1374,7 +1383,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping, >> { >> int ret; >> struct inode *inode = mapping->host; >> + struct shmem_inode_info *info = SHMEM_I(inode); >> pgoff_t index = pos >> PAGE_CACHE_SHIFT; >> + >> + /* i_mutex is held by caller */ >> + if (info->seals & F_SEAL_WRITE) >> + return -EPERM; >> + if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size) >> + return -EPERM; > > I think this is your only addition which comes in a hot path. > Mel has been shaving nanoseconds off this path recently: you're not > introducing any atomic ops here, good, but I wonder if it would make any > measurable difference to include this pair of tests inside a single > "if (unlikely(info->seals)) {". Maybe not, but it wouldn't hurt. It doesn't hurt adding the unlikely() protection, either. Fixed. >> + >> ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); >> if (ret == 0 && *pagep) >> init_page_accessed(*pagep); >> @@ -1715,11 +1732,166 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) >> return offset; >> } >> >> +/* >> + * Setting SEAL_WRITE requires us to verify there's no pending writer. However, >> + * via get_user_pages(), drivers might have some pending I/O without any active >> + * user-space mappings (eg., direct-IO, AIO). Therefore, we look at all pages >> + * and see whether it has an elevated ref-count. If so, we abort. >> + * The caller must guarantee that no new user will acquire writable references >> + * to those pages to avoid races. >> + */ >> +static int shmem_test_for_pins(struct address_space *mapping) >> +{ >> + struct radix_tree_iter iter; >> + void **slot; >> + pgoff_t start; >> + struct page *page; >> + int error; >> + >> + /* flush additional refs in lru_add early */ >> + lru_add_drain_all(); >> + >> + error = 0; >> + start = 0; >> + rcu_read_lock(); >> + >> +restart: >> + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) { >> + page = radix_tree_deref_slot(slot); >> + if (!page || radix_tree_exception(page)) { >> + if (radix_tree_deref_retry(page)) >> + goto restart; >> + } else if (page_count(page) - page_mapcount(page) > 1) { >> + error = -EBUSY; >> + break; >> + } >> + >> + if (need_resched()) { >> + cond_resched_rcu(); >> + start = iter.index + 1; >> + goto restart; >> + } >> + } >> + rcu_read_unlock(); >> + >> + return error; >> +} > > Please leave shmem_test_for_pins() (and the comment above it) > out of this particular patch. > > The implementation here satisfies none of us, make it harder to > figure out the patch improving it later, and distracts from the basic > sealing interface and functionality that you introduce in this patch. > > A brief comment on the issue instead - "But what if a page of the > object is pinned for pending I/O? See later patch" - maybe, but on the > whole I think it's better to raise and settle the issue in later patch. Removed and replaced by a dummy: int shmem_wait_for_pins(struct address_space *mapping) { return 0; } >> + >> +#define F_ALL_SEALS (F_SEAL_SEAL | \ >> + F_SEAL_SHRINK | \ >> + F_SEAL_GROW | \ >> + F_SEAL_WRITE) >> + >> +int shmem_add_seals(struct file *file, unsigned int seals) >> +{ >> + struct dentry *dentry = file->f_path.dentry; >> + struct inode *inode = dentry->d_inode; > > struct inode *inode = file_inode(file), and forget about dentry? Fixed. >> + struct shmem_inode_info *info = SHMEM_I(inode); >> + int error; >> + >> + /* >> + * SEALING >> + * Sealing allows multiple parties to share a shmem-file but restrict >> + * access to a specific subset of file operations. Seals can only be >> + * added, but never removed. This way, mutually untrusted parties can >> + * share common memory regions with a well-defined policy. A malicious >> + * peer can thus never perform unwanted operations on a shared object. >> + * >> + * Seals are only supported on special shmem-files and always affect >> + * the whole underlying inode. Once a seal is set, it may prevent some >> + * kinds of access to the file. Currently, the following seals are >> + * defined: >> + * SEAL_SEAL: Prevent further seals from being set on this file >> + * SEAL_SHRINK: Prevent the file from shrinking >> + * SEAL_GROW: Prevent the file from growing >> + * SEAL_WRITE: Prevent write access to the file >> + * >> + * As we don't require any trust relationship between two parties, we >> + * must prevent seals from being removed. Therefore, sealing a file >> + * only adds a given set of seals to the file, it never touches >> + * existing seals. Furthermore, the "setting seals"-operation can be >> + * sealed itself, which basically prevents any further seal from being >> + * added. >> + * >> + * Semantics of sealing are only defined on volatile files. Only >> + * anonymous shmem files support sealing. More importantly, seals are >> + * never written to disk. Therefore, there's no plan to support it on >> + * other file types. >> + */ >> + >> + if (file->f_op != &shmem_file_operations) >> + return -EINVAL; >> + if (!(file->f_mode & FMODE_WRITE)) >> + return -EINVAL; > > I would expect -EBADF there (like when you write to read-only fd). > Though I was okay with the -EPERM you had the previous version. Nice catch. Replaced by EPERM again. >> + if (seals & ~(unsigned int)F_ALL_SEALS) >> + return -EINVAL; >> + >> + mutex_lock(&inode->i_mutex); >> + >> + if (info->seals & F_SEAL_SEAL) { > > I notice this is inconsistent with F_SEAL_WRITE just below: > we're allowed to SEAL_WRITE what's already SEAL_WRITEd, > but not to SEAL_SEAL what's already SEAL_SEALed. > Oh, never mind, I can see that makes some sense. > >> + error = -EPERM; >> + goto unlock; >> + } >> + >> + if ((seals & F_SEAL_WRITE) && !(info->seals & F_SEAL_WRITE)) { >> + error = mapping_deny_writable(file->f_mapping); >> + if (error) > > Which would be -EBUSY, yes, that seems okay. > > And with your atomic i_mmap_writable changes in 1/7, and the i_mutex > here, the locking is now solid, and accomplished simply: nice. > >> + goto unlock; >> + >> + error = shmem_test_for_pins(file->f_mapping); >> + if (error) { >> + mapping_allow_writable(file->f_mapping); >> + goto unlock; >> + } > > Right, although I ask you to remove shmem_test_for_pins() from this > patch, I can see that you might want to include a "return 0" stub for > shmem_wait_for_pins() in this patch, just so that this can appear here > now, and we consider the non-atomicity of it. Yes, I agree this is > how it should proceed: first deny, then re-allow if waiting fails. I replaced shmem_test_for_pins() with shmem_wait_for_pins() and kept the code as is. >> + } >> + >> + info->seals |= seals; >> + error = 0; >> + >> +unlock: >> + mutex_unlock(&inode->i_mutex); >> + return error; >> +} >> +EXPORT_SYMBOL_GPL(shmem_add_seals); >> + >> +int shmem_get_seals(struct file *file) >> +{ >> + if (file->f_op != &shmem_file_operations) >> + return -EINVAL; > > That's fine, though it is worth considering whether return 0 > might be preferable. No, I suppose this is easier, fits with > shmem_fcntl() just returning -EINVAL when !TMPFS or !SHMEM. Agreed. >> + >> + return SHMEM_I(file_inode(file))->seals & F_ALL_SEALS; > > & F_ALL_SEALS? Okay, that may be some kind of future proofing that you > have in mind; but it may just be a leftover from when you were using bit > 31 for internal use. Nice catch. It's indeed a left-over. I removed it. >> +} >> +EXPORT_SYMBOL_GPL(shmem_get_seals); >> + >> +long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg) >> +{ >> + long error; >> + >> + switch (cmd) { >> + case F_ADD_SEALS: >> + /* disallow upper 32bit */ >> + if (arg >> 32) >> + return -EINVAL; > > That is worth checking, but gives > mm/shmem.c:1948:3: warning: right shift count >= width of type > on a 32-bit build. I expect there's an accepted way to do it; > I've used "arg > UINT_MAX" myself in some places. The zero-test bot already reported this and I fixed it with a u64 cast. Your UINT_MAX test is definitely nicer so I changed it again. Thanks! David >> + >> + error = shmem_add_seals(file, arg); >> + break; >> + case F_GET_SEALS: >> + error = shmem_get_seals(file); >> + break; >> + default: >> + error = -EINVAL; >> + break; >> + } >> + >> + return error; >> +} >> + >> static long shmem_fallocate(struct file *file, int mode, loff_t offset, >> loff_t len) >> { >> struct inode *inode = file_inode(file); >> struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); >> + struct shmem_inode_info *info = SHMEM_I(inode); >> struct shmem_falloc shmem_falloc; >> pgoff_t start, index, end; >> int error; >> @@ -1731,6 +1903,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, >> loff_t unmap_start = round_up(offset, PAGE_SIZE); >> loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; >> >> + /* protected by i_mutex */ >> + if (info->seals & F_SEAL_WRITE) { >> + error = -EPERM; >> + goto out; >> + } >> + >> if ((u64)unmap_end > (u64)unmap_start) >> unmap_mapping_range(mapping, unmap_start, >> 1 + unmap_end - unmap_start, 0); >> @@ -1745,6 +1923,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, >> if (error) >> goto out; >> >> + if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) { >> + error = -EPERM; >> + goto out; >> + } >> + >> start = offset >> PAGE_CACHE_SHIFT; >> end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; >> /* Try to avoid a swapstorm if len is impossible to satisfy */ >> -- >> 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html