On Mon, Dec 16, 2024 at 1:15 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > On Sun, Dec 15, 2024 at 6:35 PM Etienne Martineau > <etmartin4313@xxxxxxxxx> wrote: > > > > On Fri, Dec 13, 2024 at 9:29 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > > > > > There are situations where fuse servers can become unresponsive or > > > stuck, for example if the server is deadlocked. Currently, there's no > > > good way to detect if a server is stuck and needs to be killed manually. > > > > > > This commit adds an option for enforcing a timeout (in seconds) for > > > requests where if the timeout elapses without the server responding to > > > the request, the connection will be automatically aborted. > > > > > > Please note that these timeouts are not 100% precise. For example, the > > > request may take roughly an extra FUSE_TIMEOUT_TIMER_FREQ seconds beyond > > > the requested timeout due to internal implementation, in order to > > > mitigate overhead. > > > > > > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > > > --- > > > fs/fuse/dev.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ > > > fs/fuse/fuse_i.h | 22 +++++++++++++ > > > fs/fuse/inode.c | 23 ++++++++++++++ > > > 3 files changed, 128 insertions(+) > > > > > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c > > > index 27ccae63495d..e97ba860ffcd 100644 > > > --- a/fs/fuse/dev.c > > > +++ b/fs/fuse/dev.c > > > @@ -45,6 +45,85 @@ static struct fuse_dev *fuse_get_dev(struct file *file) > > > return READ_ONCE(file->private_data); > > > } > > > > > > +static bool request_expired(struct fuse_conn *fc, struct fuse_req *req) > > > +{ > > > + return time_is_before_jiffies(req->create_time + fc->timeout.req_timeout); > > > +} > > > + > > > +/* > > > + * Check if any requests aren't being completed by the time the request timeout > > > + * elapses. To do so, we: > > > + * - check the fiq pending list > > > + * - check the bg queue > > > + * - check the fpq io and processing lists > > > + * > > > + * To make this fast, we only check against the head request on each list since > > > + * these are generally queued in order of creation time (eg newer requests get > > > + * queued to the tail). We might miss a few edge cases (eg requests transitioning > > > + * between lists, re-sent requests at the head of the pending list having a > > > + * later creation time than other requests on that list, etc.) but that is fine > > > + * since if the request never gets fulfilled, it will eventually be caught. > > > + */ > > > +void fuse_check_timeout(struct work_struct *work) > > > +{ > > > + struct delayed_work *dwork = to_delayed_work(work); > > > + struct fuse_conn *fc = container_of(dwork, struct fuse_conn, > > > + timeout.work); > > > + struct fuse_iqueue *fiq = &fc->iq; > > > + struct fuse_req *req; > > > + struct fuse_dev *fud; > > > + struct fuse_pqueue *fpq; > > > + bool expired = false; > > > + int i; > > > + > > > + spin_lock(&fiq->lock); > > > + req = list_first_entry_or_null(&fiq->pending, struct fuse_req, list); > > > + if (req) > > > + expired = request_expired(fc, req); > > > + spin_unlock(&fiq->lock); > > > + if (expired) > > > + goto abort_conn; > > > + > > > + spin_lock(&fc->bg_lock); > > > + req = list_first_entry_or_null(&fc->bg_queue, struct fuse_req, list); > > > + if (req) > > > + expired = request_expired(fc, req); > > > + spin_unlock(&fc->bg_lock); > > > + if (expired) > > > + goto abort_conn; > > > + > > > + spin_lock(&fc->lock); > > > + if (!fc->connected) { > > > + spin_unlock(&fc->lock); > > > + return; > > > + } > > > + list_for_each_entry(fud, &fc->devices, entry) { > > > + fpq = &fud->pq; > > > + spin_lock(&fpq->lock); > > > > Can fuse_dev_release() run concurrently to this path here? > > If yes say fuse_dev_release() comes in first, grab the fpq->lock and > > splice the > > fpq->processing[i] list into &to_end and release the fpq->lock which > > unblock this > > path. > > > > Then here we start checking req off the fpq->processing[i] list which is > > getting evicted on the other side by fuse_dev_release->end_requests(&to_end); > > > > Maybe we need a cancel_delayed_work_sync() at the beginning of > > fuse_dev_release ? > > Yes, fuse_dev_release() can run concurrently to this path here. If > fuse_dev_release() comes in first, grabs the fpq->lock and splices the > fpq->processing[i] lists into &to_end, then releases the fpq->lock, > and then this fuse_check_timeout() grabs the fpq->lock, it'll see no > requests on the fpq->processing[i] lists. When the requests are > spliced onto the to_end list in fuse_dev_release(), they are removed > from the &fpq->processing[i] list. Yes, good point about list splice. After all, I realized that the same locking sequence is present in fuse_abort_conn() which is proven to work. ( otherwise we would have heard about race issues coming from fuse_dev_release() against concurrent fuse_conn_abort_write() ) > For that reason I don't think we need a cancel_delayed_work_sync() at > the beginning of fuse_dev_release(), but also a connection can have > multiple devs associated with it and the workqueue job is > per-connection and not per-device. Ok got it. Thanks, Etienne > > > Thanks, > Joanne > > > Thanks > > Etienne > > > > > + req = list_first_entry_or_null(&fpq->io, struct fuse_req, list); > > > + if (req && request_expired(fc, req)) > > > + goto fpq_abort; > > > + > > > + for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) { > > > + req = list_first_entry_or_null(&fpq->processing[i], struct fuse_req, list); > > > + if (req && request_expired(fc, req)) > > > + goto fpq_abort; > > > + } > > > + spin_unlock(&fpq->lock); > > > + } > > > + spin_unlock(&fc->lock); > > > + > > > + queue_delayed_work(system_wq, &fc->timeout.work, > > > + secs_to_jiffies(FUSE_TIMEOUT_TIMER_FREQ)); > > > + return; > > > + > > > +fpq_abort: > > > + spin_unlock(&fpq->lock); > > > + spin_unlock(&fc->lock); > > > +abort_conn: > > > + fuse_abort_conn(fc); > > > +} > > > +