Re: [PATCH v6 1/2] fuse: add optional kernel-enforced timeout for requests

Joanne Koong <joannelkoong@xxxxxxxxx> · Tue, 1 Oct 2024 10:03:48 -0700

On Sat, Sep 28, 2024 at 1:43 AM Bernd Schubert
<bernd.schubert@xxxxxxxxxxx> wrote:
>
> Hi Joanne,
>
> On 9/27/24 21:36, Joanne Koong wrote:
> > On Mon, Sep 2, 2024 at 3:38 AM Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
> >>
> >> On Fri, 30 Aug 2024 at 18:27, Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> >>>
> >>> There are situations where fuse servers can become unresponsive or
> >>> stuck, for example if the server is in a deadlock. Currently, there's
> >>> no good way to detect if a server is stuck and needs to be killed
> >>> manually.
> >>>
> >>> This commit adds an option for enforcing a timeout (in seconds) on
> >>> requests where if the timeout elapses without a reply from the server,
> >>> the connection will be automatically aborted.
> >>
> >> Okay.
> >>
> >> I'm not sure what the overhead (scheduling and memory) of timers, but
> >> starting one for each request seems excessive.
> >
> > I ran some benchmarks on this using the passthrough_ll server and saw
> > roughly a 1.5% drop in throughput (from ~775 MiB/s to ~765 MiB/s):
> > fio --name randwrite --ioengine=sync --thread --invalidate=1
> > --runtime=300 --ramp_time=10 --rw=randwrite --size=1G --numjobs=4
> > --bs=4k --alloc-size 98304 --allrandrepeat=1 --randseed=12345
> > --group_reporting=1 --directory=/root/fuse_mount
> >
> > Instead of attaching a timer to each request, I think we can instead
> > do the following:
> > * add a "start_time" field to each request tracking (in jiffies) when
> > the request was started
> > * add a new list to the connection that all requests get enqueued
> > onto. When the request is completed, it gets dequeued from this list
> > * have a timer for the connection that fires off every 10 seconds or
> > so. When this timer is fired, it checks if "jiffies > req->start_time
> > + fc->req_timeout" against the head of the list to check if the
> > timeout has expired and we need to abort the request. We only need to
> > check against the head of the list because we know every other request
> > after this was started later in time. I think we could even just use
> > the fc->lock for this instead of needing a separate lock. In the worst
> > case, this grants a 10 second upper bound on the timeout a user
> > requests (eg if the user requests 2 minutes, in the worst case the
> > timeout would trigger at 2 minutes and 10 seconds).
> >
> > Also, now that we're aborting the connection entirely on a timeout
> > instead of just aborting the request, maybe it makes sense to change
> > the timeout granularity to minutes instead of seconds. I'm envisioning
> > that this timeout mechanism will mostly be used as a safeguard against
> > malicious or buggy servers with a high timeout configured (eg 10
> > minutes), and minutes seems like a nicer interface for users than them
> > having to convert that to seconds.
> >
> > Let me know if I've missed anything with this approach but if not,
> > then I'll submit v7 with this change.
>
>
> sounds great to me. Just, could we do this per fuse_dev to avoid a
> single lock for all cores?
>

Will do! thanks for the suggestion - in that case, I'll add its own
spinlock for it too then.

Thanks,
Joanne

>
> Thanks,
> Bernd