On Tue, Oct 1, 2024 at 10:03 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > On Sat, Sep 28, 2024 at 1:43 AM Bernd Schubert > <bernd.schubert@xxxxxxxxxxx> wrote: > > > > Hi Joanne, > > > > On 9/27/24 21:36, Joanne Koong wrote: > > > On Mon, Sep 2, 2024 at 3:38 AM Miklos Szeredi <miklos@xxxxxxxxxx> wrote: > > >> > > >> On Fri, 30 Aug 2024 at 18:27, Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > >>> > > >>> There are situations where fuse servers can become unresponsive or > > >>> stuck, for example if the server is in a deadlock. Currently, there's > > >>> no good way to detect if a server is stuck and needs to be killed > > >>> manually. > > >>> > > >>> This commit adds an option for enforcing a timeout (in seconds) on > > >>> requests where if the timeout elapses without a reply from the server, > > >>> the connection will be automatically aborted. > > >> > > >> Okay. > > >> > > >> I'm not sure what the overhead (scheduling and memory) of timers, but > > >> starting one for each request seems excessive. > > > > > > I ran some benchmarks on this using the passthrough_ll server and saw > > > roughly a 1.5% drop in throughput (from ~775 MiB/s to ~765 MiB/s): > > > fio --name randwrite --ioengine=sync --thread --invalidate=1 > > > --runtime=300 --ramp_time=10 --rw=randwrite --size=1G --numjobs=4 > > > --bs=4k --alloc-size 98304 --allrandrepeat=1 --randseed=12345 > > > --group_reporting=1 --directory=/root/fuse_mount > > > > > > Instead of attaching a timer to each request, I think we can instead > > > do the following: > > > * add a "start_time" field to each request tracking (in jiffies) when > > > the request was started > > > * add a new list to the connection that all requests get enqueued > > > onto. When the request is completed, it gets dequeued from this list > > > * have a timer for the connection that fires off every 10 seconds or > > > so. When this timer is fired, it checks if "jiffies > req->start_time > > > + fc->req_timeout" against the head of the list to check if the > > > timeout has expired and we need to abort the request. We only need to > > > check against the head of the list because we know every other request > > > after this was started later in time. I think we could even just use > > > the fc->lock for this instead of needing a separate lock. In the worst > > > case, this grants a 10 second upper bound on the timeout a user > > > requests (eg if the user requests 2 minutes, in the worst case the > > > timeout would trigger at 2 minutes and 10 seconds). > > > > > > Also, now that we're aborting the connection entirely on a timeout > > > instead of just aborting the request, maybe it makes sense to change > > > the timeout granularity to minutes instead of seconds. I'm envisioning > > > that this timeout mechanism will mostly be used as a safeguard against > > > malicious or buggy servers with a high timeout configured (eg 10 > > > minutes), and minutes seems like a nicer interface for users than them > > > having to convert that to seconds. > > > > > > Let me know if I've missed anything with this approach but if not, > > > then I'll submit v7 with this change. > > > > > > sounds great to me. Just, could we do this per fuse_dev to avoid a > > single lock for all cores? > > > > Will do! thanks for the suggestion - in that case, I'll add its own > spinlock for it too then. Actually, looking at this some more, we can just put this in the "struct fuse_pqueue" and use the fpq spinlock since the check for whether any requests timed out will be very quick (eg just checking against the first entry in the list). > > Thanks, > Joanne > > > > > Thanks, > > Bernd