On Mon, Sep 2, 2024 at 3:38 AM Miklos Szeredi <miklos@xxxxxxxxxx> wrote: > > On Fri, 30 Aug 2024 at 18:27, Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > > > There are situations where fuse servers can become unresponsive or > > stuck, for example if the server is in a deadlock. Currently, there's > > no good way to detect if a server is stuck and needs to be killed > > manually. > > > > This commit adds an option for enforcing a timeout (in seconds) on > > requests where if the timeout elapses without a reply from the server, > > the connection will be automatically aborted. > > Okay. > > I'm not sure what the overhead (scheduling and memory) of timers, but > starting one for each request seems excessive. I ran some benchmarks on this using the passthrough_ll server and saw roughly a 1.5% drop in throughput (from ~775 MiB/s to ~765 MiB/s): fio --name randwrite --ioengine=sync --thread --invalidate=1 --runtime=300 --ramp_time=10 --rw=randwrite --size=1G --numjobs=4 --bs=4k --alloc-size 98304 --allrandrepeat=1 --randseed=12345 --group_reporting=1 --directory=/root/fuse_mount Instead of attaching a timer to each request, I think we can instead do the following: * add a "start_time" field to each request tracking (in jiffies) when the request was started * add a new list to the connection that all requests get enqueued onto. When the request is completed, it gets dequeued from this list * have a timer for the connection that fires off every 10 seconds or so. When this timer is fired, it checks if "jiffies > req->start_time + fc->req_timeout" against the head of the list to check if the timeout has expired and we need to abort the request. We only need to check against the head of the list because we know every other request after this was started later in time. I think we could even just use the fc->lock for this instead of needing a separate lock. In the worst case, this grants a 10 second upper bound on the timeout a user requests (eg if the user requests 2 minutes, in the worst case the timeout would trigger at 2 minutes and 10 seconds). Also, now that we're aborting the connection entirely on a timeout instead of just aborting the request, maybe it makes sense to change the timeout granularity to minutes instead of seconds. I'm envisioning that this timeout mechanism will mostly be used as a safeguard against malicious or buggy servers with a high timeout configured (eg 10 minutes), and minutes seems like a nicer interface for users than them having to convert that to seconds. Let me know if I've missed anything with this approach but if not, then I'll submit v7 with this change. Thanks, Joanne > > Can we make the timeout per-connection instead of per request? > > I.e. When the first request is sent, the timer is started. When a > reply is received but there are still outstanding requests, the timer > is reset. When the last reply is received, the timer is stopped. > > This should handle the frozen server case just as well. It may not > perfectly handle the case when the server is still alive but for some > reason one or more requests get stuck, while others are still being > processed. The latter case is unlikely to be an issue in practice, > IMO. > > Thanks, > Miklos