Hi, Christian, Thanks for the review. On 5/28/24 4:38 PM, Christian Brauner wrote: > On Fri, May 24, 2024 at 02:40:28PM +0800, Jingbo Xu wrote: >> Background >> ========== >> The fd of '/dev/fuse' serves as a message transmission channel between >> FUSE filesystem (kernel space) and fuse server (user space). Once the >> fd gets closed (intentionally or unintentionally), the FUSE filesystem >> gets aborted, and any attempt of filesystem access gets -ECONNABORTED >> error until the FUSE filesystem finally umounted. >> >> It is one of the requisites in production environment to provide >> uninterruptible filesystem service. The most straightforward way, and >> maybe the most widely used way, is that make another dedicated user >> daemon (similar to systemd fdstore) keep the device fd open. When the >> fuse daemon recovers from a crash, it can retrieve the device fd from the >> fdstore daemon through socket takeover (Unix domain socket) method [1] >> or pidfd_getfd() syscall [2]. In this way, as long as the fdstore >> daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse >> daemon crashes, though the filesystem service may hang there for a while >> when the fuse daemon gets restarted and has not been completely >> recovered yet. >> >> This picture indeed works and has been deployed in our internal >> production environment until the following issues are encountered: >> >> 1. The fdstore daemon may be killed by mistake, in which case the FUSE >> filesystem gets aborted and irrecoverable. > > That's only a problem if you use the fdstore of the per-user instance. > The main fdstore is part of PID 1 and you can't kill that. So really, > systemd needs to hand the fds from the per-user instance to the main > fdstore. Systemd indeed has implemented its own fdstore mechanism in the user space. Nowadays more and more fuse daemons are running inside containers, but a container generally has no systemd inside it. > >> 2. In scenarios of containerized deployment, the fuse daemon is deployed >> in a container POD, and a dedicated fdstore daemon needs to be deployed >> for each fuse daemon. The fdstore daemon could consume a amount of >> resources (e.g. memory footprint), which is not conducive to the dense >> container deployment. >> >> 3. Each fuse daemon implementation needs to implement its own fdstore >> daemon. If we implement the fuse recovery mechanism on the kernel side, >> all fuse daemon implementations could reuse this mechanism. > > You can just the global fdstore. That is a design limitation not an > inherent limitation. What I initially mean is that each fuse daemon implementation (e.g. s3fs, ossfs, and other vendors) needs to make its own but similar mechanism for daemon failover. There has not been a common component for fdstore in container scenarios just like systemd fdstore. I'd admit that it's controversial to implement a kernel-side fdstore. Thus I only implement a failover mechanism for fuse server in this RFC patch. But I also understand Miklos's concern as what we really need to support daemon failover is just something like fdstore to keep the device fd alive. -- Thanks, Jingbo