Re: [PATCH v6 1/2] fuse: add optional kernel-enforced timeout for requests

Joanne Koong <joannelkoong@xxxxxxxxx> · Tue, 3 Sep 2024 10:25:55 -0700

On Mon, Sep 2, 2024 at 3:38 AM Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
>
> On Fri, 30 Aug 2024 at 18:27, Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> >
> > There are situations where fuse servers can become unresponsive or
> > stuck, for example if the server is in a deadlock. Currently, there's
> > no good way to detect if a server is stuck and needs to be killed
> > manually.
> >
> > This commit adds an option for enforcing a timeout (in seconds) on
> > requests where if the timeout elapses without a reply from the server,
> > the connection will be automatically aborted.
>
> Okay.
>
> I'm not sure what the overhead (scheduling and memory) of timers, but
> starting one for each request seems excessive.
>
> Can we make the timeout per-connection instead of per request?
>
> I.e. When the first request is sent, the timer is started. When a
> reply is received but there are still outstanding requests, the timer
> is reset.  When the last reply is received, the timer is stopped.
>
> This should handle the frozen server case just as well.  It may not
> perfectly handle the case when the server is still alive but for some
> reason one or more requests get stuck, while others are still being
> processed.   The latter case is unlikely to be an issue in practice,
> IMO.

In that case, if the timeout is per-connection instead of per-request
and we're not stringent about some requests getting stuck, maybe it
makes more sense to just do this in userspace (libfuse) then? That
seems pretty simple with having a watchdog thread that periodically
(according to whatever specified timeout) checks if the number of
requests serviced is increasing when
/sys/fs/fuse/connections/*/waiting is non-zero.

If there are multiple server threads (eg libfuse's fuse_loop_mt
interface) and say, all of them are deadlocked except for 1 that is
actively servicing requests, then this wouldn't catch that case, but
even if this per-connection timeout was enforced in the kernel
instead, it wouldn't catch that case either.

So maybe this logic should just be moved to libfuse then? For this
we'd need to pass the connection's device id (fc->dev) to userspace
which i don't think we currently do, but that seems pretty simple. The
one downside I see is that this doesn't let sysadmins enforce an
automatic system-wide "max timeout" against malicious fuse servers but
if we are having the timeout be per-connection instead of per-request,
then a malicious server could still be malicious anyways (eg
deadlocking all threads except for 1).

Curious to hear what your and Bernrd's thoughts on this are.

Thanks,
Joanne
>
> Thanks,
> Miklos