Background ========== The fd of '/dev/fuse' serves as a message transmission channel between FUSE filesystem (kernel space) and fuse server (user space). Once the fd gets closed (intentionally or unintentionally), the FUSE filesystem gets aborted, and any attempt of filesystem access gets -ECONNABORTED error until the FUSE filesystem finally umounted. It is one of the requisites in production environment to provide uninterruptible filesystem service. The most straightforward way, and maybe the most widely used way, is that make another dedicated user daemon (similar to systemd fdstore) keep the device fd open. When the fuse daemon recovers from a crash, it can retrieve the device fd from the fdstore daemon through socket takeover (Unix domain socket) method [1] or pidfd_getfd() syscall [2]. In this way, as long as the fdstore daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse daemon crashes, though the filesystem service may hang there for a while when the fuse daemon gets restarted and has not been completely recovered yet. This picture indeed works and has been deployed in our internal production environment until the following issues are encountered: 1. The fdstore daemon may be killed by mistake, in which case the FUSE filesystem gets aborted and irrecoverable. 2. In scenarios of containerized deployment, the fuse daemon is deployed in a container POD, and a dedicated fdstore daemon needs to be deployed for each fuse daemon. The fdstore daemon could consume a amount of resources (e.g. memory footprint), which is not conducive to the dense container deployment. 3. Each fuse daemon implementation needs to implement its own fdstore daemon. If we implement the fuse recovery mechanism on the kernel side, all fuse daemon implementations could reuse this mechanism. What we do ========== Basic Recovery Mechanism ------------------------ We introduce a recovery mechanism for fuse server on the kernel side. To do this: 1. Introduce a new "tag=" mount option, with which users could identify a fuse connection with a unique name. 2. Introduce a new FUSE_DEV_IOC_ATTACH ioctl, with which the fuse server could reconnect to the fuse connection corresponding to the given tag. 3. Introduce a new FUSE_HAS_RECOVERY init flag. The fuse server should advertise this feature if it supports server recovery. With the above recovery mechanism, the whole time sequence is like: - At the initial mount, the fuse filesystem is mounted with "tag=" option - The fuse server advertises FUSE_HAS_RECOVERY flag when replying FUSE_INIT - When the fuse server crashes and the (/dev/fuse) device fd is closed, the fuse connection won't be aborted. - The requests submitted after the server crash will keep staying in the iqueue; the processes submitting the requests will hang there - The fuse server gets restarted and recovers the previous state before crash (including the negotiation results of the last FUSE_INIT) - The fuse server opens /dev/fuse and gets a new device fd, and then runs FUSE_DEV_IOC_ATTACH ioctl on the new device fd to retrieve the fuse connection with the tag previously used to mount the fuse filesystem - The fuse server issues a FUSE_NOTIFY_RESEND notification to request the kernel to resend those inflight requests that have been sent to the fuse server before the server crash but not been replied yet - The fuse server starts to process requests normally (those queued in iqueue and those resent by FUSE_NOTIFY_RESEND) In summary, the requests submitted after the server crash will stay in the iqueue and get serviced once the fuse server recovers from the crash and retrieve the previous fuse connection. As for the inflight requests that have been sent to the fuse server before the server crash but not been replied yet, the fuse server could request the kernel to resend those inflight requests through FUSE_NOTIFY_RESEND notification type. Security Enhancement --------------------- Besides, we offer a uid-based security enhancement for the fuse server recovery mechanism. Otherwise any malicious attacker could kill the fuse server and take the filesystem service over with the recovery mechanism. To implement this, we introduce a new "rescue_uid=" mount option specifying the expected uid of the legal process running the fuse server. Then only the process with the matching uid is permissible to retrieve the fuse connection with the server recovery mechanism. Limitation ========== 1. The current mechanism won't resend a new FUSE_INIT request to fuse server and start a new negotiation when the fuse server attempts to re-attach to the fuse connection through FUSE_DEV_IOC_ATTACH ioctl. Thus the fuse server needs to recover the previous state before crash (including the negotiation results of the last FUSE_INIT) by itself. PS. Thus I had to do hacking tricks on libfuse passthrough_ll daemon when testing the recovery feature. 2. With the current recovery mechanism, the fuse filesystem won't get aborted when the fuse server crashes. A following umount will get hung there. The call stack shows the hang task is waiting for FUSE_GETATTR on the mntpoint: [<0>] request_wait_answer+0xe1/0x200 [<0>] fuse_simple_request+0x18e/0x2a0 [<0>] fuse_do_getattr+0xc9/0x180 [<0>] vfs_statx+0x92/0x170 [<0>] vfs_fstatat+0x7c/0xb0 [<0>] __do_sys_newstat+0x1d/0x40 [<0>] do_syscall_64+0x60/0x170 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e It's not fixed yet in this RFC version. 3. I don't know if a kernel based recovery mechanism is welcome on the community side. Any comment is welcome. Thanks! [1] https://copyconstruct.medium.com/file-descriptor-transfer-over-unix-domain-sockets-dcbbf5b3b6ec [2] https://copyconstruct.medium.com/seamless-file-descriptor-transfer-between-processes-with-pidfd-and-pidfd-getfd-816afcd19ed4 Jingbo Xu (2): fuse: introduce recovery mechanism for fuse server fuse: uid-based security enhancement for the recovery mechanism fs/fuse/dev.c | 55 ++++++++++++++++++++++++++++++++++++++- fs/fuse/fuse_i.h | 15 +++++++++++ fs/fuse/inode.c | 46 +++++++++++++++++++++++++++++++- include/uapi/linux/fuse.h | 7 +++++ 4 files changed, 121 insertions(+), 2 deletions(-) -- 2.19.1.6.gb485710b