Re: [PATCH v6 1/2] fuse: add optional kernel-enforced timeout for requests

Bernd Schubert <bernd.schubert@xxxxxxxxxxx> · Sat, 28 Sep 2024 10:43:08 +0200

Hi Joanne,

On 9/27/24 21:36, Joanne Koong wrote:
> On Mon, Sep 2, 2024 at 3:38 AM Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
>>
>> On Fri, 30 Aug 2024 at 18:27, Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
>>>
>>> There are situations where fuse servers can become unresponsive or
>>> stuck, for example if the server is in a deadlock. Currently, there's
>>> no good way to detect if a server is stuck and needs to be killed
>>> manually.
>>>
>>> This commit adds an option for enforcing a timeout (in seconds) on
>>> requests where if the timeout elapses without a reply from the server,
>>> the connection will be automatically aborted.
>>
>> Okay.
>>
>> I'm not sure what the overhead (scheduling and memory) of timers, but
>> starting one for each request seems excessive.
> 
> I ran some benchmarks on this using the passthrough_ll server and saw
> roughly a 1.5% drop in throughput (from ~775 MiB/s to ~765 MiB/s):
> fio --name randwrite --ioengine=sync --thread --invalidate=1
> --runtime=300 --ramp_time=10 --rw=randwrite --size=1G --numjobs=4
> --bs=4k --alloc-size 98304 --allrandrepeat=1 --randseed=12345
> --group_reporting=1 --directory=/root/fuse_mount
> 
> Instead of attaching a timer to each request, I think we can instead
> do the following:
> * add a "start_time" field to each request tracking (in jiffies) when
> the request was started
> * add a new list to the connection that all requests get enqueued
> onto. When the request is completed, it gets dequeued from this list
> * have a timer for the connection that fires off every 10 seconds or
> so. When this timer is fired, it checks if "jiffies > req->start_time
> + fc->req_timeout" against the head of the list to check if the
> timeout has expired and we need to abort the request. We only need to
> check against the head of the list because we know every other request
> after this was started later in time. I think we could even just use
> the fc->lock for this instead of needing a separate lock. In the worst
> case, this grants a 10 second upper bound on the timeout a user
> requests (eg if the user requests 2 minutes, in the worst case the
> timeout would trigger at 2 minutes and 10 seconds).
> 
> Also, now that we're aborting the connection entirely on a timeout
> instead of just aborting the request, maybe it makes sense to change
> the timeout granularity to minutes instead of seconds. I'm envisioning
> that this timeout mechanism will mostly be used as a safeguard against
> malicious or buggy servers with a high timeout configured (eg 10
> minutes), and minutes seems like a nicer interface for users than them
> having to convert that to seconds.
> 
> Let me know if I've missed anything with this approach but if not,
> then I'll submit v7 with this change.

sounds great to me. Just, could we do this per fuse_dev to avoid a
single lock for all cores?

Thanks,
Bernd