On 9/3/24 19:25, Joanne Koong wrote: > On Mon, Sep 2, 2024 at 3:38 AM Miklos Szeredi <miklos@xxxxxxxxxx> wrote: >> >> On Fri, 30 Aug 2024 at 18:27, Joanne Koong <joannelkoong@xxxxxxxxx> wrote: >>> >>> There are situations where fuse servers can become unresponsive or >>> stuck, for example if the server is in a deadlock. Currently, there's >>> no good way to detect if a server is stuck and needs to be killed >>> manually. >>> >>> This commit adds an option for enforcing a timeout (in seconds) on >>> requests where if the timeout elapses without a reply from the server, >>> the connection will be automatically aborted. >> >> Okay. >> >> I'm not sure what the overhead (scheduling and memory) of timers, but >> starting one for each request seems excessive. >> >> Can we make the timeout per-connection instead of per request? >> >> I.e. When the first request is sent, the timer is started. When a >> reply is received but there are still outstanding requests, the timer >> is reset. When the last reply is received, the timer is stopped. >> >> This should handle the frozen server case just as well. It may not >> perfectly handle the case when the server is still alive but for some >> reason one or more requests get stuck, while others are still being >> processed. The latter case is unlikely to be an issue in practice, >> IMO. > > In that case, if the timeout is per-connection instead of per-request > and we're not stringent about some requests getting stuck, maybe it > makes more sense to just do this in userspace (libfuse) then? That > seems pretty simple with having a watchdog thread that periodically > (according to whatever specified timeout) checks if the number of > requests serviced is increasing when > /sys/fs/fuse/connections/*/waiting is non-zero. > > If there are multiple server threads (eg libfuse's fuse_loop_mt > interface) and say, all of them are deadlocked except for 1 that is > actively servicing requests, then this wouldn't catch that case, but > even if this per-connection timeout was enforced in the kernel > instead, it wouldn't catch that case either. > > So maybe this logic should just be moved to libfuse then? For this > we'd need to pass the connection's device id (fc->dev) to userspace > which i don't think we currently do, but that seems pretty simple. The > one downside I see is that this doesn't let sysadmins enforce an > automatic system-wide "max timeout" against malicious fuse servers but > if we are having the timeout be per-connection instead of per-request, > then a malicious server could still be malicious anyways (eg > deadlocking all threads except for 1). > > Curious to hear what your and Bernrd's thoughts on this are. I have question here, does it need to be an exact timeout or could it be an interval/epoch? Let's say you timeout based on epoch lists? Example 1) epoch-a starts, requests are added to epoch-a list. 2) epoch-b starts, epoch-a list should get empty 3) epoch-c starts, epoch-b list should get empty, kill the connection if epoch-a list is not empty (epoch-c list should not be needed, as epoch-a list can be used, once confirmed it is empty.) Here timeout would be epoch-a + epoch-b, i.e. max-timeout <= 2 * epoch-time. We could have more epochs/list-heads to make it more fine grained.