Re: [PATCH v6 1/2] fuse: add optional kernel-enforced timeout for requests

Bernd Schubert <bernd.schubert@xxxxxxxxxxx> · Mon, 7 Oct 2024 22:02:01 +0200

On 10/7/24 20:39, Joanne Koong wrote:
> On Tue, Oct 1, 2024 at 10:03 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
>>
>> On Sat, Sep 28, 2024 at 1:43 AM Bernd Schubert
>> <bernd.schubert@xxxxxxxxxxx> wrote:
>>>
>>> Hi Joanne,
>>>
>>> On 9/27/24 21:36, Joanne Koong wrote:
>>>> On Mon, Sep 2, 2024 at 3:38 AM Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
>>>>>
>>>>> On Fri, 30 Aug 2024 at 18:27, Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
>>>>>>
>>>>>> There are situations where fuse servers can become unresponsive or
>>>>>> stuck, for example if the server is in a deadlock. Currently, there's
>>>>>> no good way to detect if a server is stuck and needs to be killed
>>>>>> manually.
>>>>>>
>>>>>> This commit adds an option for enforcing a timeout (in seconds) on
>>>>>> requests where if the timeout elapses without a reply from the server,
>>>>>> the connection will be automatically aborted.
>>>>>
>>>>> Okay.
>>>>>
>>>>> I'm not sure what the overhead (scheduling and memory) of timers, but
>>>>> starting one for each request seems excessive.
>>>>
>>>> I ran some benchmarks on this using the passthrough_ll server and saw
>>>> roughly a 1.5% drop in throughput (from ~775 MiB/s to ~765 MiB/s):
>>>> fio --name randwrite --ioengine=sync --thread --invalidate=1
>>>> --runtime=300 --ramp_time=10 --rw=randwrite --size=1G --numjobs=4
>>>> --bs=4k --alloc-size 98304 --allrandrepeat=1 --randseed=12345
>>>> --group_reporting=1 --directory=/root/fuse_mount
>>>>
>>>> Instead of attaching a timer to each request, I think we can instead
>>>> do the following:
>>>> * add a "start_time" field to each request tracking (in jiffies) when
>>>> the request was started
>>>> * add a new list to the connection that all requests get enqueued
>>>> onto. When the request is completed, it gets dequeued from this list
>>>> * have a timer for the connection that fires off every 10 seconds or
>>>> so. When this timer is fired, it checks if "jiffies > req->start_time
>>>> + fc->req_timeout" against the head of the list to check if the
>>>> timeout has expired and we need to abort the request. We only need to
>>>> check against the head of the list because we know every other request
>>>> after this was started later in time. I think we could even just use
>>>> the fc->lock for this instead of needing a separate lock. In the worst
>>>> case, this grants a 10 second upper bound on the timeout a user
>>>> requests (eg if the user requests 2 minutes, in the worst case the
>>>> timeout would trigger at 2 minutes and 10 seconds).
>>>>
>>>> Also, now that we're aborting the connection entirely on a timeout
>>>> instead of just aborting the request, maybe it makes sense to change
>>>> the timeout granularity to minutes instead of seconds. I'm envisioning
>>>> that this timeout mechanism will mostly be used as a safeguard against
>>>> malicious or buggy servers with a high timeout configured (eg 10
>>>> minutes), and minutes seems like a nicer interface for users than them
>>>> having to convert that to seconds.
>>>>
>>>> Let me know if I've missed anything with this approach but if not,
>>>> then I'll submit v7 with this change.
>>>
>>>
>>> sounds great to me. Just, could we do this per fuse_dev to avoid a
>>> single lock for all cores?
>>>
>>
>> Will do! thanks for the suggestion - in that case, I'll add its own
>> spinlock for it too then.
> 
> I realized while working on v7 that we can't do this per fuse device
> because the request is only associated with a device once it's read in
> by the server (eg fuse_dev_do_read).
> 
> I ran some rough preliminary benchmarks with
> ./libfuse/build/example/passthrough_ll  -o writeback -o max_threads=4
> -o source=~/fstests ~/fuse_mount
> and
> fio --name randwrite --ioengine=sync --thread --invalidate=1
> --runtime=300 --ramp_time=10 --rw=randwrite --size=1G --numjobs=4
> --bs=4k --alloc-size 98304 --allrandrepeat=1 --randseed=12345
> --group_reporting=1 --directory=fuse_mount
> 
> and didn't see any noticeable difference in throughput (~37 MiB/sec on
> my system) with vs without the timeout.
> 

That is not much, isn't your limit the backend? I wonder what would happen
with 25GB/s I'm testing with. Wouldn't it make sense for this to test with 
sequential large IO? And possibly with the passthrough-hp branch that
bypasses IO? And a NUMA system probably would be helpful as well. 
I.e. to test the effect on the kernel side without having an IO limited
system?

With the io-uring interface requests stay in queues from the in-coming CPU
so easier to achieve there, although I wonder for your use-case if it
wouldn't be sufficient to start the timer only when the request is on
the way to fuse-server? One disadvantage I see is that virtiofs would need
to be specially handled.

Thanks,
Bernd