Re: [PATCH RESEND v9 2/3] fuse: add optional kernel-enforced timeout for requests

Joanne Koong <joannelkoong@xxxxxxxxx> · Mon, 2 Dec 2024 11:29:04 -0800

On Mon, Dec 2, 2024 at 6:43 AM Bernd Schubert
<bernd.schubert@xxxxxxxxxxx> wrote:
>
> On 12/2/24 10:45, Tomasz Figa wrote:
> > Hi everyone,
> >
> > On Thu, Nov 28, 2024 at 8:55 PM Sergey Senozhatsky
> > <senozhatsky@xxxxxxxxxxxx> wrote:
> >>
> >> Cc-ing Tomasz
> >>
> >> On (24/11/28 11:23), Bernd Schubert wrote:
> >>>> Thanks for the pointers again, Bernd.
> >>>>
> >>>>> Miklos had asked for to abort the connection in v4
> >>>>> https://lore.kernel.org/all/CAJfpegsiRNnJx7OAoH58XRq3zujrcXx94S2JACFdgJJ_b8FdHw@xxxxxxxxxxxxxx/raw
> >>>>
> >>>> OK, sounds reasonable. I'll try to give the series some testing in the
> >>>> coming days.
> >>>>
> >>>> // I still would probably prefer "seconds" timeout granularity.
> >>>> // Unless this also has been discussed already and Bernd has a link ;)
> >>>
> >>>
> >>> The issue is that is currently iterating through 256 hash lists +
> >>> pending + bg.
> >>>
> >>> https://lore.kernel.org/all/CAJnrk1b7bfAWWq_pFP=4XH3ddc_9GtAM2mE7EgWnx2Od+UUUjQ@xxxxxxxxxxxxxx/raw
> >>
> >> Oh, I see.
> >>
> >>> Personally I would prefer a second list to avoid the check spike and latency
> >>> https://lore.kernel.org/linux-fsdevel/9ba4eaf4-b9f0-483f-90e5-9512aded419e@xxxxxxxxxxx/raw
> >>
> >> That's good to know.  I like the idea of less CPU usage in general,
> >> our devices a battery powered so everything counts, to some extent.
> >>
> >>> What is your opinion about that? I guess android and chromium have an
> >>> interest low latencies and avoiding cpu spikes?
> >>
> >> Good question.
> >>
> >> Can't speak for android, in chromeos we probably will keep it at 1 minute,
> >> but this is because our DEFAULT_HUNG_TASK_TIMEOUT is larger than that (we
> >> use default value of 120 sec). There are setups that might use lower
> >> values, or even re-define default value, e.g.:
> >>
> >> arch/arc/configs/axs101_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=10
> >> arch/arc/configs/axs103_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=10
> >> arch/arc/configs/axs103_smp_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=10
> >> arch/arc/configs/hsdk_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=10
> >> arch/arc/configs/vdk_hs38_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=10
> >> arch/arc/configs/vdk_hs38_smp_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=10
> >> arch/powerpc/configs/mvme5100_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=20
> >>
> >> In those cases 1 minute fuse timeout will overshot HUNG_TASK_TIMEOUT
> >> and then the question is whether HUNG_TASK_PANIC is set.
> >>
> >> On the other hand, setups that set much lower timeout than
> >> DEFAULT_HUNG_TASK_TIMEOUT=120 will have extra CPU activities regardless,
> >> just because watchdogs will run more often.
> >>
> >> Tomasz, any opinions?
> >
> > First of all, thanks everyone for looking into this.

Hi Sergey and Tomasz,

Sorry for the late reply - I was out the last couple of days. Thanks
Bernd for weighing in and answering the questions!

> >
> > How about keeping a list of requests in the FIFO order (in other
> > words: first entry is the first to timeout) and whenever the first
> > entry is being removed from the list (aka the request actually
> > completes), re-arming the timer to the timeout of the next request in
> > the list? This way we don't really have any timer firing unless there
> > is really a request that timed out.

I think the issue with this is that we likely would end up wasting
more cpu cycles. For a busy FUSE server, there could be hundreds
(thousands?) of requests that happen within the span of
FUSE_TIMEOUT_TIMER_FREQ seconds.

While working on the patch, one thing I considered was disarming the
timer in the timeout handler fuse_check_timeout() if no requests are
on the list, in order to accomodate for "quiet periods" (eg if the
FUSE server is inactive for a few minutes or hours) but ultimately
decided against it because of the overhead it'd incur per request (eg
check if the timer is disarmed, would most likely need to grab the
fc->lock as well since timer rearming would need to be synchronized
between background and non-background requests, etc.).

All in all, imo I don't think having the timer trigger every 60
seconds (what FUSE_TIMEOUT_TIMER_FREQ is set to) is too costly.

>
> Requests are in FIFO order on the list and only head is checked.
> There are 256 hash lists per fuse device for requests currently
> in user space, though.
>
> >
> > (In fact, we could optimize it even further by opportunistically
> > scheduling a timer slightly later and opportunistically handling timed
> > out requests when other requests are being completed, but this would
> > be optimizing for the slow path, so probably an overkill.)
> >
> > As for the length of the request timeout vs the hung task watchdog
> > timeout, my opinion is that we should make sure that the hung task
> > watchdog doesn't hit in any case, simply because a misbehaving
> > userspace process must not be able to panic the kernel. In the
> > blk-core, the blk_io_schedule() function [1] uses
> > sysctl_hung_task_timeout_secs to determine the maximum length of a
> > single uninterruptible sleep. I suppose we could use the same
> > calculation to obtain our timeout number. What does everyone think?
> >
> > [1] https://elixir.bootlin.com/linux/v6.12.1/source/block/blk-core.c#L1232
>
> I think that is a good idea.

Btw, just something to note, the fuse request timeout has an upper
margin of error associated with it.

Copying over from the commit message -

"Please note that these timeouts are not 100% precise. The request may
take an extra FUSE_TIMEOUT_TIMER_FREQ seconds beyond the requested max
timeout due to how it's internally implemented."

For example, if a server sets the request timeout config to 10
minutes, the server could be aborted after 11 minutes
(FUSE_TIMEOUT_TIMER_FREQ is set to 60 seconds internally) instead of
10 minutes.

Thanks,
Joanne
>
>
> Thanks,
> Bernd