Re: [PATCH v6 1/2] fuse: add optional kernel-enforced timeout for requests

Joanne Koong <joannelkoong@xxxxxxxxx> · Fri, 27 Sep 2024 12:36:23 -0700

On Mon, Sep 2, 2024 at 3:38 AM Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
>
> On Fri, 30 Aug 2024 at 18:27, Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> >
> > There are situations where fuse servers can become unresponsive or
> > stuck, for example if the server is in a deadlock. Currently, there's
> > no good way to detect if a server is stuck and needs to be killed
> > manually.
> >
> > This commit adds an option for enforcing a timeout (in seconds) on
> > requests where if the timeout elapses without a reply from the server,
> > the connection will be automatically aborted.
>
> Okay.
>
> I'm not sure what the overhead (scheduling and memory) of timers, but
> starting one for each request seems excessive.

I ran some benchmarks on this using the passthrough_ll server and saw
roughly a 1.5% drop in throughput (from ~775 MiB/s to ~765 MiB/s):
fio --name randwrite --ioengine=sync --thread --invalidate=1
--runtime=300 --ramp_time=10 --rw=randwrite --size=1G --numjobs=4
--bs=4k --alloc-size 98304 --allrandrepeat=1 --randseed=12345
--group_reporting=1 --directory=/root/fuse_mount

Instead of attaching a timer to each request, I think we can instead
do the following:
* add a "start_time" field to each request tracking (in jiffies) when
the request was started
* add a new list to the connection that all requests get enqueued
onto. When the request is completed, it gets dequeued from this list
* have a timer for the connection that fires off every 10 seconds or
so. When this timer is fired, it checks if "jiffies > req->start_time
+ fc->req_timeout" against the head of the list to check if the
timeout has expired and we need to abort the request. We only need to
check against the head of the list because we know every other request
after this was started later in time. I think we could even just use
the fc->lock for this instead of needing a separate lock. In the worst
case, this grants a 10 second upper bound on the timeout a user
requests (eg if the user requests 2 minutes, in the worst case the
timeout would trigger at 2 minutes and 10 seconds).

Also, now that we're aborting the connection entirely on a timeout
instead of just aborting the request, maybe it makes sense to change
the timeout granularity to minutes instead of seconds. I'm envisioning
that this timeout mechanism will mostly be used as a safeguard against
malicious or buggy servers with a high timeout configured (eg 10
minutes), and minutes seems like a nicer interface for users than them
having to convert that to seconds.

Let me know if I've missed anything with this approach but if not,
then I'll submit v7 with this change.

Thanks,
Joanne

>
> Can we make the timeout per-connection instead of per request?
>
> I.e. When the first request is sent, the timer is started. When a
> reply is received but there are still outstanding requests, the timer
> is reset.  When the last reply is received, the timer is stopped.
>
> This should handle the frozen server case just as well.  It may not
> perfectly handle the case when the server is still alive but for some
> reason one or more requests get stuck, while others are still being
> processed.   The latter case is unlikely to be an issue in practice,
> IMO.
>
> Thanks,
> Miklos