Re: [PATCH v2] fuse: Abort connection if FUSE server get stuck

Etienne Martineau <etmartin4313@xxxxxxxxx> · Thu, 12 Dec 2024 17:45:56 -0500

On Thu, Dec 12, 2024 at 4:48 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
>
> On Wed, Dec 11, 2024 at 3:04 PM Etienne Martineau
> <etmartin4313@xxxxxxxxx> wrote:
> >
> > On Wed, Dec 11, 2024 at 5:04 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> > >
> > > On Wed, Dec 11, 2024 at 11:45 AM <etmartin4313@xxxxxxxxx> wrote:
> > > >
> > > > From: Etienne Martineau <etmartin4313@xxxxxxxxx>
> > > >
> > > > This patch abort connection if HUNG_TASK_PANIC is set and a FUSE server
> > > > is getting stuck for too long. A slow FUSE server may tripped the
> > > > hang check timer for legitimate reasons hence consider disabling
> > > > HUNG_TASK_PANIC in that scenario.
> > > >
> > > > Without this patch, an unresponsive / buggy / malicious FUSE server can
> > > > leave the clients in D state for a long period of time and on system where
> > > > HUNG_TASK_PANIC is set, trigger a catastrophic reload.
> > > >
> > > > So, if HUNG_TASK_PANIC checking is enabled, we should wake up periodically
> > > > to abort connections that exceed the timeout value which is define to be
> > > > half the HUNG_TASK_TIMEOUT period, which keeps overhead low. The timer
> > > > is per connection and runs only if there are active FUSE request pending.
> > >
> > > Hi Etienne,
> > >
> > > For your use case, does the generic request timeouts logic and
> > > max_request_timeout systemctl implemented in [1] and [2] not suffice?
> > > IMO I don't think we should have logic specifically checking for hung
> > > task timeouts in fuse, if the generic solution can be used.
> > >
> > > Thanks,
> > > Joanne
> >
> > We need a way to avoid catastrophic reloads on systems where HUNG_TASK_PANIC
> > is set while a buggy / malicious FUSE server stops responding.
> > I would argue that this is much needed in stable branches as well...
> >
> > For that reason, I believe we need to keep things simple for step #1
> > e.g. there is no
> > need to introduce another knob as we already have HUNG_TASK_TIMEOUT which
> > holds the source of truth.
> >
> > IMO introducing those new knobs will put an unnecessary burden on sysadmins into
> > something that is error prone because unlike
> >   CONFIG_DETECT_HUNG_TASK=y
> >   CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
> > which is built-in, the "default_request_timeout" /
> > "max_request_timeout" needs to be
> > set appropriately after every reboot and failure to do so may have
> > nasty consequences.
>
> imo, it is not important to directly defend against the hung task case
> inside the fuse code itself. imo, the generic timeout should be used
> instead. As I understand it, hung task panic is mostly enabled for
> debug purposes and is enabled through a sysctl. imo, if the system
> admin enables the hung task panic sysctl value, then it is not too
> much of a burden for them to also set the fuse max request timeout
> sysctl.
>
>
> Thanks,
> Joanne
>
Yes, based on the comments received so far I agree that generic timeout is the
prefered approach. Looks like we are amongst the few that run production
systems with hung task panic set. So yeah, I will match fuse max request
timeout with hung task timeout to get the equivalent behavior.

On a slightly different matter, I realized that in some scenarios
there is no benefit
in stopping the timer when reaching the last request because another
request can come
right after and then we have to start the timer once again which keeps bouncing
between cancel_delayed_work_sync() and queue_delayed_work().

So I think it's best to stick with your approach of starting the timer
when the connection
is initially established. I can re-work this patch if needed?

I've been doing some testing and so far I hit timeout in bg_queue and
fpq->processing
but I cannot trigger timeouts in fiq->pending somehow?

thanks
Etienne