Re: [PATCH RESEND v9 2/3] fuse: add optional kernel-enforced timeout for requests

Brian Geffon <bgeffon@xxxxxxxxxx> · Wed, 4 Dec 2024 09:51:53 -0500

On Wed, Dec 4, 2024 at 9:40 AM Tomasz Figa <tfiga@xxxxxxxxxxxx> wrote:
>
> On Tue, Dec 3, 2024 at 4:29 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> >
> > On Mon, Dec 2, 2024 at 6:43 AM Bernd Schubert
> > <bernd.schubert@xxxxxxxxxxx> wrote:
> > >
> > > On 12/2/24 10:45, Tomasz Figa wrote:
> > > > Hi everyone,
> > > >
> > > > On Thu, Nov 28, 2024 at 8:55 PM Sergey Senozhatsky
> > > > <senozhatsky@xxxxxxxxxxxx> wrote:
> > > >>
> > > >> Cc-ing Tomasz
> > > >>
> > > >> On (24/11/28 11:23), Bernd Schubert wrote:
> > > >>>> Thanks for the pointers again, Bernd.
> > > >>>>
> > > >>>>> Miklos had asked for to abort the connection in v4
> > > >>>>> https://lore.kernel.org/all/CAJfpegsiRNnJx7OAoH58XRq3zujrcXx94S2JACFdgJJ_b8FdHw@xxxxxxxxxxxxxx/raw
> > > >>>>
> > > >>>> OK, sounds reasonable. I'll try to give the series some testing in the
> > > >>>> coming days.
> > > >>>>
> > > >>>> // I still would probably prefer "seconds" timeout granularity.
> > > >>>> // Unless this also has been discussed already and Bernd has a link ;)
> > > >>>
> > > >>>
> > > >>> The issue is that is currently iterating through 256 hash lists +
> > > >>> pending + bg.
> > > >>>
> > > >>> https://lore.kernel.org/all/CAJnrk1b7bfAWWq_pFP=4XH3ddc_9GtAM2mE7EgWnx2Od+UUUjQ@xxxxxxxxxxxxxx/raw
> > > >>
> > > >> Oh, I see.
> > > >>
> > > >>> Personally I would prefer a second list to avoid the check spike and latency
> > > >>> https://lore.kernel.org/linux-fsdevel/9ba4eaf4-b9f0-483f-90e5-9512aded419e@xxxxxxxxxxx/raw
> > > >>
> > > >> That's good to know.  I like the idea of less CPU usage in general,
> > > >> our devices a battery powered so everything counts, to some extent.
> > > >>
> > > >>> What is your opinion about that? I guess android and chromium have an
> > > >>> interest low latencies and avoiding cpu spikes?
> > > >>
> > > >> Good question.
> > > >>
> > > >> Can't speak for android, in chromeos we probably will keep it at 1 minute,
> > > >> but this is because our DEFAULT_HUNG_TASK_TIMEOUT is larger than that (we
> > > >> use default value of 120 sec). There are setups that might use lower
> > > >> values, or even re-define default value, e.g.:
> > > >>
> > > >> arch/arc/configs/axs101_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=10
> > > >> arch/arc/configs/axs103_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=10
> > > >> arch/arc/configs/axs103_smp_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=10
> > > >> arch/arc/configs/hsdk_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=10
> > > >> arch/arc/configs/vdk_hs38_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=10
> > > >> arch/arc/configs/vdk_hs38_smp_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=10
> > > >> arch/powerpc/configs/mvme5100_defconfig:CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=20
> > > >>
> > > >> In those cases 1 minute fuse timeout will overshot HUNG_TASK_TIMEOUT
> > > >> and then the question is whether HUNG_TASK_PANIC is set.

In my opinion this is a good argument for having the hung task timeout
and a fuse timeout independent. The hung task timeout is for hung
kernel threads, in this situation we're potentially taking too long in
userspace but that doesn't necessarily mean the system is hung. I
think a loop which does an interruptible wait with a timeout of 1/2
the hung task timeout would make sense to ensure the hung task timeout
doesn't hit. There might be situations where we want a fuse timeout
which is larger than the hung task timeout, perhaps a file system
being read over a satellite internet connection?

> > > >>
> > > >> On the other hand, setups that set much lower timeout than
> > > >> DEFAULT_HUNG_TASK_TIMEOUT=120 will have extra CPU activities regardless,
> > > >> just because watchdogs will run more often.
> > > >>
> > > >> Tomasz, any opinions?
> > > >
> > > > First of all, thanks everyone for looking into this.
> >
> > Hi Sergey and Tomasz,
> >
> > Sorry for the late reply - I was out the last couple of days. Thanks
> > Bernd for weighing in and answering the questions!
> >
> > > >
> > > > How about keeping a list of requests in the FIFO order (in other
> > > > words: first entry is the first to timeout) and whenever the first
> > > > entry is being removed from the list (aka the request actually
> > > > completes), re-arming the timer to the timeout of the next request in
> > > > the list? This way we don't really have any timer firing unless there
> > > > is really a request that timed out.
> >
> > I think the issue with this is that we likely would end up wasting
> > more cpu cycles. For a busy FUSE server, there could be hundreds
> > (thousands?) of requests that happen within the span of
> > FUSE_TIMEOUT_TIMER_FREQ seconds.
> >
> > While working on the patch, one thing I considered was disarming the
> > timer in the timeout handler fuse_check_timeout() if no requests are
> > on the list, in order to accomodate for "quiet periods" (eg if the
> > FUSE server is inactive for a few minutes or hours) but ultimately
> > decided against it because of the overhead it'd incur per request (eg
> > check if the timer is disarmed, would most likely need to grab the
> > fc->lock as well since timer rearming would need to be synchronized
> > between background and non-background requests, etc.).
> >
> > All in all, imo I don't think having the timer trigger every 60
> > seconds (what FUSE_TIMEOUT_TIMER_FREQ is set to) is too costly.
> >
> > >
> > > Requests are in FIFO order on the list and only head is checked.
> > > There are 256 hash lists per fuse device for requests currently
> > > in user space, though.
> > >
> > > >
> > > > (In fact, we could optimize it even further by opportunistically
> > > > scheduling a timer slightly later and opportunistically handling timed
> > > > out requests when other requests are being completed, but this would
> > > > be optimizing for the slow path, so probably an overkill.)
> > > >
> > > > As for the length of the request timeout vs the hung task watchdog
> > > > timeout, my opinion is that we should make sure that the hung task
> > > > watchdog doesn't hit in any case, simply because a misbehaving
> > > > userspace process must not be able to panic the kernel. In the
> > > > blk-core, the blk_io_schedule() function [1] uses
> > > > sysctl_hung_task_timeout_secs to determine the maximum length of a
> > > > single uninterruptible sleep. I suppose we could use the same
> > > > calculation to obtain our timeout number. What does everyone think?
> > > >
> > > > [1] https://elixir.bootlin.com/linux/v6.12.1/source/block/blk-core.c#L1232
> > >
> > > I think that is a good idea.
> >
> > Btw, just something to note, the fuse request timeout has an upper
> > margin of error associated with it.
> >
> > Copying over from the commit message -
> >
> > "Please note that these timeouts are not 100% precise. The request may
> > take an extra FUSE_TIMEOUT_TIMER_FREQ seconds beyond the requested max
> > timeout due to how it's internally implemented."
> >
> > For example, if a server sets the request timeout config to 10
> > minutes, the server could be aborted after 11 minutes
> > (FUSE_TIMEOUT_TIMER_FREQ is set to 60 seconds internally) instead of
> > 10 minutes.
> >
>
> Let me add +Brian Geffon who also was thinking about the right timeout value.
>
> >
> > Thanks,
> > Joanne
> > >
> > >
> > > Thanks,
> > > Bernd