On Mon, Jun 12, 2023 at 03:52:21PM +0200, Sergei Shtepa wrote: > The header file contains a set of declarations, structures and control > requests (ioctl) that allows to manage the module from the user space. > > Co-developed-by: Christoph Hellwig <hch@xxxxxxxxxxxxx> > Signed-off-by: Christoph Hellwig <hch@xxxxxxxxxxxxx> > Tested-by: Donald Buczek <buczek@xxxxxxxxxxxxx> > Signed-off-by: Sergei Shtepa <sergei.shtepa@xxxxxxxxx> > --- > MAINTAINERS | 1 + > include/uapi/linux/blksnap.h | 421 +++++++++++++++++++++++++++++++++++ > 2 files changed, 422 insertions(+) > create mode 100644 include/uapi/linux/blksnap.h ..... > +/** > + * struct blksnap_snapshot_append_storage - Argument for the > + * &IOCTL_BLKSNAP_SNAPSHOT_APPEND_STORAGE control. > + * > + * @id: > + * Snapshot ID. > + * @bdev_path: > + * Device path string buffer. > + * @bdev_path_size: > + * Device path string buffer size. > + * @count: > + * Size of @ranges in the number of &struct blksnap_sectors. > + * @ranges: > + * Pointer to the array of &struct blksnap_sectors. > + */ > +struct blksnap_snapshot_append_storage { > + struct blksnap_uuid id; > + __u64 bdev_path; > + __u32 bdev_path_size; > + __u32 count; > + __u64 ranges; > +}; > + > +/** > + * define IOCTL_BLKSNAP_SNAPSHOT_APPEND_STORAGE - Append storage to the > + * difference storage of the snapshot. > + * > + * The snapshot difference storage can be set either before or after creating > + * the snapshot images. This allows to dynamically expand the difference > + * storage while holding the snapshot. > + * > + * Return: 0 if succeeded, negative errno otherwise. > + */ > +#define IOCTL_BLKSNAP_SNAPSHOT_APPEND_STORAGE \ > + _IOW(BLKSNAP, blksnap_ioctl_snapshot_append_storage, \ > + struct blksnap_snapshot_append_storage) That's an API I'm extremely uncomfortable with. We've learnt the lesson *many times* that userspace physical mappings of underlying file storage are unreliable. i.e. This is reliant on userspace telling the kernel the physical mapping of the filesystem file to block device LBA space and then providing a guarantee (somehow) that the mapping will always remain unchanged. i.e. It's reliant on passing FIEMAP data from the filesystem to userspace and then back into the kernel without it becoming stale and somehow providing a guarantee that nothing (not even the filesystem doing internal garbage collection) will change it. It is reliant on userspace detecting shared blocks in files and avoiding them; it's reliant on userspace never being able to read, write or modify that file; it's reliant on the -filesystem- never modifying the layout of that file; it's even reliant on a internal filesystem state that has to be locked down before the block mapping can be delegated to a third party for IO control. Further, we can't allow userspace to have any read access to the snapshot file even after it is no longer in use by the blksnap driver. The contents of the file will span multiple security contexts, contain sensitive data, etc and so it's contents must never be exposed to userspace. We cannot rely on userspace to delete it safely after use and hence we have to protect it's contents from exposure to userspace, too. We already have a mechanism that provides all these guarantees to a third party kernel subsystem: swap files. We already have a trusted path in the kernel to allow internal block mapping of a swap file to be retreived by the mm subsystem. We also have an inode flag that protects it such files against access and modification from anything other than internal kernel IO paths. We also allow them to be allocated as unwritten extents using fallocate() and we are never converted to written whilist in use as a swapfile. Hence the contents of them cannot be exposed to userspace even if the swapfile flag is removed and owner/permission changes are made to the file after it is released by the kernel. Swap files are an intrinsically safe mechanism for delegating fixed file mappings to kernel subsystems that have requirements for secure, trusted storage that userspace cannot tamper with. I note that the code behind the IOCTL_BLKSNAP_SNAPSHOT_APPEND_STORAGE ends up in diff_storage_add_range(), which allocates an extent structure for each range and links it into a linked list for later use. This is effectively the same structure that the mm swapfile code uses. It provides a swap_info_struct and a struct file to the filesystem via aops->swap_activate. The filesystem then iterates the extent list for the file and calls add_swap_extent() for each physical range in the file. The mm code then allocates a new extent structure for the range and links it into the extent rbtree in the swap_info_struct. This is the mapping it uses later on in the IO path. Adding a similar, more generic mapping operation that allows a private structure and a callback to the provided would allow the filesystem to provide this callback directly to subsystems like blksnap. Essentially diff_storage_add_range() becomes the iterator callback for blksnap. This makes the whole "userspace provides the mapping" problem goes away and we can use the swapfile mechanisms to provide all the other guarantees the kernel needs to ensure it can trust the contents and mappings of the blksnap snapshot files.... Thoughts? -Dave. -- Dave Chinner david@xxxxxxxxxxxxx