Re: [PATCH v6 1/2] fuse: add optional kernel-enforced timeout for requests

Joanne Koong <joannelkoong@xxxxxxxxx> · Tue, 8 Oct 2024 09:26:55 -0700

On Mon, Oct 7, 2024 at 1:02 PM Bernd Schubert
<bernd.schubert@xxxxxxxxxxx> wrote:
>
> On 10/7/24 20:39, Joanne Koong wrote:
> > On Tue, Oct 1, 2024 at 10:03 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> >>
> >> On Sat, Sep 28, 2024 at 1:43 AM Bernd Schubert
> >> <bernd.schubert@xxxxxxxxxxx> wrote:
> >>>
> >>> Hi Joanne,
> >>>
> >>> On 9/27/24 21:36, Joanne Koong wrote:
> >>>> On Mon, Sep 2, 2024 at 3:38 AM Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> On Fri, 30 Aug 2024 at 18:27, Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> >>>>>>
> >>>>>> There are situations where fuse servers can become unresponsive or
> >>>>>> stuck, for example if the server is in a deadlock. Currently, there's
> >>>>>> no good way to detect if a server is stuck and needs to be killed
> >>>>>> manually.
> >>>>>>
> >>>>>> This commit adds an option for enforcing a timeout (in seconds) on
> >>>>>> requests where if the timeout elapses without a reply from the server,
> >>>>>> the connection will be automatically aborted.
> >>>>>
> >>>>> Okay.
> >>>>>
> >>>>> I'm not sure what the overhead (scheduling and memory) of timers, but
> >>>>> starting one for each request seems excessive.
> >>>>
> >>>> I ran some benchmarks on this using the passthrough_ll server and saw
> >>>> roughly a 1.5% drop in throughput (from ~775 MiB/s to ~765 MiB/s):
> >>>> fio --name randwrite --ioengine=sync --thread --invalidate=1
> >>>> --runtime=300 --ramp_time=10 --rw=randwrite --size=1G --numjobs=4
> >>>> --bs=4k --alloc-size 98304 --allrandrepeat=1 --randseed=12345
> >>>> --group_reporting=1 --directory=/root/fuse_mount
> >>>>
> >>>> Instead of attaching a timer to each request, I think we can instead
> >>>> do the following:
> >>>> * add a "start_time" field to each request tracking (in jiffies) when
> >>>> the request was started
> >>>> * add a new list to the connection that all requests get enqueued
> >>>> onto. When the request is completed, it gets dequeued from this list
> >>>> * have a timer for the connection that fires off every 10 seconds or
> >>>> so. When this timer is fired, it checks if "jiffies > req->start_time
> >>>> + fc->req_timeout" against the head of the list to check if the
> >>>> timeout has expired and we need to abort the request. We only need to
> >>>> check against the head of the list because we know every other request
> >>>> after this was started later in time. I think we could even just use
> >>>> the fc->lock for this instead of needing a separate lock. In the worst
> >>>> case, this grants a 10 second upper bound on the timeout a user
> >>>> requests (eg if the user requests 2 minutes, in the worst case the
> >>>> timeout would trigger at 2 minutes and 10 seconds).
> >>>>
> >>>> Also, now that we're aborting the connection entirely on a timeout
> >>>> instead of just aborting the request, maybe it makes sense to change
> >>>> the timeout granularity to minutes instead of seconds. I'm envisioning
> >>>> that this timeout mechanism will mostly be used as a safeguard against
> >>>> malicious or buggy servers with a high timeout configured (eg 10
> >>>> minutes), and minutes seems like a nicer interface for users than them
> >>>> having to convert that to seconds.
> >>>>
> >>>> Let me know if I've missed anything with this approach but if not,
> >>>> then I'll submit v7 with this change.
> >>>
> >>>
> >>> sounds great to me. Just, could we do this per fuse_dev to avoid a
> >>> single lock for all cores?
> >>>
> >>
> >> Will do! thanks for the suggestion - in that case, I'll add its own
> >> spinlock for it too then.
> >
> > I realized while working on v7 that we can't do this per fuse device
> > because the request is only associated with a device once it's read in
> > by the server (eg fuse_dev_do_read).
> >
> > I ran some rough preliminary benchmarks with
> > ./libfuse/build/example/passthrough_ll  -o writeback -o max_threads=4
> > -o source=~/fstests ~/fuse_mount
> > and
> > fio --name randwrite --ioengine=sync --thread --invalidate=1
> > --runtime=300 --ramp_time=10 --rw=randwrite --size=1G --numjobs=4
> > --bs=4k --alloc-size 98304 --allrandrepeat=1 --randseed=12345
> > --group_reporting=1 --directory=fuse_mount
> >
> > and didn't see any noticeable difference in throughput (~37 MiB/sec on
> > my system) with vs without the timeout.
> >
>
>
> That is not much, isn't your limit the backend? I wonder what would happen
> with 25GB/s I'm testing with. Wouldn't it make sense for this to test with
> sequential large IO? And possibly with the passthrough-hp branch that
> bypasses IO? And a NUMA system probably would be helpful as well.
> I.e. to test the effect on the kernel side without having an IO limited
> system?
>

The preliminary benchmarks yesterday were run on a VM because I had
trouble getting consistent results between baseline runs (on origin
w/out my changes) on my usual test machine. I'm going to get this
sorted out and run some tests again.

What are you testing on that's giving you 25 GB/s?

>
> With the io-uring interface requests stay in queues from the in-coming CPU
> so easier to achieve there, although I wonder for your use-case if it
> wouldn't be sufficient to start the timer only when the request is on
> the way to fuse-server? One disadvantage I see is that virtiofs would need
> to be specially handled.

Unfortunately I don't think it suffices to only start the timer when
the request is on the way to the fuse server. If there's a malicious
or deadlocked server, they might not read from /dev/fuse, but we would
want to abort the connection in those cases as well.

Thanks,
Joanne

>
>
> Thanks,
> Bernd
>
>