On Wed, Dec 11, 2024 at 5:04 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > On Wed, Dec 11, 2024 at 11:45 AM <etmartin4313@xxxxxxxxx> wrote: > > > > From: Etienne Martineau <etmartin4313@xxxxxxxxx> > > > > This patch abort connection if HUNG_TASK_PANIC is set and a FUSE server > > is getting stuck for too long. A slow FUSE server may tripped the > > hang check timer for legitimate reasons hence consider disabling > > HUNG_TASK_PANIC in that scenario. > > > > Without this patch, an unresponsive / buggy / malicious FUSE server can > > leave the clients in D state for a long period of time and on system where > > HUNG_TASK_PANIC is set, trigger a catastrophic reload. > > > > So, if HUNG_TASK_PANIC checking is enabled, we should wake up periodically > > to abort connections that exceed the timeout value which is define to be > > half the HUNG_TASK_TIMEOUT period, which keeps overhead low. The timer > > is per connection and runs only if there are active FUSE request pending. > > Hi Etienne, > > For your use case, does the generic request timeouts logic and > max_request_timeout systemctl implemented in [1] and [2] not suffice? > IMO I don't think we should have logic specifically checking for hung > task timeouts in fuse, if the generic solution can be used. > > Thanks, > Joanne We need a way to avoid catastrophic reloads on systems where HUNG_TASK_PANIC is set while a buggy / malicious FUSE server stops responding. I would argue that this is much needed in stable branches as well... For that reason, I believe we need to keep things simple for step #1 e.g. there is no need to introduce another knob as we already have HUNG_TASK_TIMEOUT which holds the source of truth. IMO introducing those new knobs will put an unnecessary burden on sysadmins into something that is error prone because unlike CONFIG_DETECT_HUNG_TASK=y CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120 which is built-in, the "default_request_timeout" / "max_request_timeout" needs to be set appropriately after every reboot and failure to do so may have nasty consequences. For the more generic cases then yes those timeouts make sense to me. Thanks, Etienne > > [1] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-1-joannelkoong@xxxxxxxxx/ > [2] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-4-joannelkoong@xxxxxxxxx/ > > > > > A FUSE client can get into D state as such ( see below scenario #1 / #2 ) > > 1) request_wait_answer() -> wait_event() is UNINTERRUPTIBLE > > OR > > 2) request_wait_answer() -> wait_event_(interruptible / killable) is head > > of line blocking for subsequent clients accessing the same file > > > > scenario #1: > > 2716 pts/2 D+ 0:00 cat > > $ cat /proc/2716/stack > > [<0>] request_wait_answer+0x22e/0x340 > > [<0>] __fuse_simple_request+0xd8/0x2c0 > > [<0>] fuse_perform_write+0x3ec/0x760 > > [<0>] fuse_file_write_iter+0x3d5/0x3f0 > > [<0>] vfs_write+0x313/0x430 > > [<0>] ksys_write+0x6a/0xf0 > > [<0>] __x64_sys_write+0x19/0x30 > > [<0>] x64_sys_call+0x18ab/0x26f0 > > [<0>] do_syscall_64+0x7c/0x170 > > [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > scenario #2: > > 2962 pts/2 S+ 0:00 cat > > 2963 pts/3 D+ 0:00 cat > > $ cat /proc/2962/stack > > [<0>] request_wait_answer+0x140/0x340 > > [<0>] __fuse_simple_request+0xd8/0x2c0 > > [<0>] fuse_perform_write+0x3ec/0x760 > > [<0>] fuse_file_write_iter+0x3d5/0x3f0 > > [<0>] vfs_write+0x313/0x430 > > [<0>] ksys_write+0x6a/0xf0 > > [<0>] __x64_sys_write+0x19/0x30 > > [<0>] x64_sys_call+0x18ab/0x26f0 > > [<0>] do_syscall_64+0x7c/0x170 > > [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > $ cat /proc/2963/stack > > [<0>] fuse_file_write_iter+0x252/0x3f0 > > [<0>] vfs_write+0x313/0x430 > > [<0>] ksys_write+0x6a/0xf0 > > [<0>] __x64_sys_write+0x19/0x30 > > [<0>] x64_sys_call+0x18ab/0x26f0 > > [<0>] do_syscall_64+0x7c/0x170 > > [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > Please note that this patch doesn't prevent the HUNG_TASK_WARNING. > > > > Signed-off-by: Etienne Martineau <etmartin4313@xxxxxxxxx> > > --- > > fs/fuse/dev.c | 100 +++++++++++++++++++++++++++++++++++ > > fs/fuse/fuse_i.h | 8 +++ > > fs/fuse/inode.c | 3 ++ > > include/linux/sched/sysctl.h | 8 ++- > > kernel/hung_task.c | 3 +- > > 5 files changed, 119 insertions(+), 3 deletions(-) > > > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c > > index 27ccae63495d..73d19de14e51 100644 > > --- a/fs/fuse/dev.c > > +++ b/fs/fuse/dev.c > > @@ -21,6 +21,8 @@ > > #include <linux/swap.h> > > #include <linux/splice.h> > > #include <linux/sched.h> > > +#include <linux/completion.h> > > +#include <linux/sched/sysctl.h> > > > > #define CREATE_TRACE_POINTS > > #include "fuse_trace.h" > > @@ -45,14 +47,104 @@ static struct fuse_dev *fuse_get_dev(struct file *file) > > return READ_ONCE(file->private_data); > > } > > > > +static bool request_expired(struct fuse_conn *fc, struct fuse_req *req, > > + int timeout) > > +{ > > + return time_after(jiffies, req->create_time + timeout); > > +} > > + > > +/* > > + * Prevent hung task timer from firing at us > > + * Check if any requests aren't being completed by the specified request > > + * timeout. To do so, we: > > + * - check the fiq pending list > > + * - check the bg queue > > + * - check the fpq io and processing lists > > + * > > + * To make this fast, we only check against the head request on each list since > > + * these are generally queued in order of creation time (eg newer requests get > > + * queued to the tail). We might miss a few edge cases (eg requests transitioning > > + * between lists, re-sent requests at the head of the pending list having a > > + * later creation time than other requests on that list, etc.) but that is fine > > + * since if the request never gets fulfilled, it will eventually be caught. > > + */ > > +void fuse_check_timeout(struct work_struct *wk) > > +{ > > + unsigned long hang_check_timer = sysctl_hung_task_timeout_secs * (HZ / 2); > > + struct fuse_conn *fc = container_of(wk, struct fuse_conn, work.work); > > + struct fuse_iqueue *fiq = &fc->iq; > > + struct fuse_req *req; > > + struct fuse_dev *fud; > > + struct fuse_pqueue *fpq; > > + bool expired = false; > > + int i; > > + > > + spin_lock(&fiq->lock); > > + req = list_first_entry_or_null(&fiq->pending, struct fuse_req, list); > > + if (req) > > + expired = request_expired(fc, req, hang_check_timer); > > + spin_unlock(&fiq->lock); > > + if (expired) > > + goto abort_conn; > > + > > + spin_lock(&fc->bg_lock); > > + req = list_first_entry_or_null(&fc->bg_queue, struct fuse_req, list); > > + if (req) > > + expired = request_expired(fc, req, hang_check_timer); > > + spin_unlock(&fc->bg_lock); > > + if (expired) > > + goto abort_conn; > > + > > + spin_lock(&fc->lock); > > + if (!fc->connected) { > > + spin_unlock(&fc->lock); > > + return; > > + } > > + list_for_each_entry(fud, &fc->devices, entry) { > > + fpq = &fud->pq; > > + spin_lock(&fpq->lock); > > + req = list_first_entry_or_null(&fpq->io, struct fuse_req, list); > > + if (req && request_expired(fc, req, hang_check_timer)) > > + goto fpq_abort; > > + > > + for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) { > > + req = list_first_entry_or_null(&fpq->processing[i], struct fuse_req, list); > > + if (req && request_expired(fc, req, hang_check_timer)) > > + goto fpq_abort; > > + } > > + spin_unlock(&fpq->lock); > > + } > > + /* Keep the ball rolling */ > > + if (atomic_read(&fc->num_waiting) != 0) > > + queue_delayed_work(system_wq, &fc->work, hang_check_timer); > > + spin_unlock(&fc->lock); > > + return; > > + > > +fpq_abort: > > + spin_unlock(&fpq->lock); > > + spin_unlock(&fc->lock); > > +abort_conn: > > + fuse_abort_conn(fc); > > +} > > + > > static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req) > > { > > + struct fuse_conn *fc = fm->fc; > > INIT_LIST_HEAD(&req->list); > > INIT_LIST_HEAD(&req->intr_entry); > > init_waitqueue_head(&req->waitq); > > refcount_set(&req->count, 1); > > __set_bit(FR_PENDING, &req->flags); > > req->fm = fm; > > + req->create_time = jiffies; > > + > > + if (sysctl_hung_task_panic) { > > + spin_lock(&fc->lock); > > + /* Get the ball rolling */ > > + queue_delayed_work(system_wq, &fc->work, > > + sysctl_hung_task_timeout_secs * (HZ / 2)); > > + spin_unlock(&fc->lock); > > + } > > } > > > > static struct fuse_req *fuse_request_alloc(struct fuse_mount *fm, gfp_t flags) > > @@ -200,6 +292,14 @@ static void fuse_put_request(struct fuse_req *req) > > fuse_drop_waiting(fc); > > } > > > > + if (sysctl_hung_task_panic) { > > + spin_lock(&fc->lock); > > + /* Stop the timeout check if we are the last request */ > > + if (atomic_read(&fc->num_waiting) == 0) > > + cancel_delayed_work_sync(&fc->work); > > + spin_unlock(&fc->lock); > > + } > > + > > fuse_request_free(req); > > } > > } > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > > index 74744c6f2860..aba3ffd0fb67 100644 > > --- a/fs/fuse/fuse_i.h > > +++ b/fs/fuse/fuse_i.h > > @@ -438,6 +438,9 @@ struct fuse_req { > > > > /** fuse_mount this request belongs to */ > > struct fuse_mount *fm; > > + > > + /** When (in jiffies) the request was created */ > > + unsigned long create_time; > > }; > > > > struct fuse_iqueue; > > @@ -923,6 +926,9 @@ struct fuse_conn { > > /** IDR for backing files ids */ > > struct idr backing_files_map; > > #endif > > + > > + /** Request wait timeout check */ > > + struct delayed_work work; > > }; > > > > /* > > @@ -1190,6 +1196,8 @@ void fuse_request_end(struct fuse_req *req); > > /* Abort all requests */ > > void fuse_abort_conn(struct fuse_conn *fc); > > void fuse_wait_aborted(struct fuse_conn *fc); > > +/* Connection timeout */ > > +void fuse_check_timeout(struct work_struct *wk); > > > > /** > > * Invalidate inode attributes > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > > index 3ce4f4e81d09..ed96154df0fd 100644 > > --- a/fs/fuse/inode.c > > +++ b/fs/fuse/inode.c > > @@ -23,6 +23,7 @@ > > #include <linux/exportfs.h> > > #include <linux/posix_acl.h> > > #include <linux/pid_namespace.h> > > +#include <linux/completion.h> > > #include <uapi/linux/magic.h> > > > > MODULE_AUTHOR("Miklos Szeredi <miklos@xxxxxxxxxx>"); > > @@ -964,6 +965,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm, > > INIT_LIST_HEAD(&fc->entry); > > INIT_LIST_HEAD(&fc->devices); > > atomic_set(&fc->num_waiting, 0); > > + INIT_DELAYED_WORK(&fc->work, fuse_check_timeout); > > fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND; > > fc->congestion_threshold = FUSE_DEFAULT_CONGESTION_THRESHOLD; > > atomic64_set(&fc->khctr, 0); > > @@ -1015,6 +1017,7 @@ void fuse_conn_put(struct fuse_conn *fc) > > if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH)) > > fuse_backing_files_free(fc); > > call_rcu(&fc->rcu, delayed_release); > > + cancel_delayed_work_sync(&fc->work); > > } > > } > > EXPORT_SYMBOL_GPL(fuse_conn_put); > > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h > > index 5a64582b086b..1ed3a511060d 100644 > > --- a/include/linux/sched/sysctl.h > > +++ b/include/linux/sched/sysctl.h > > @@ -5,11 +5,15 @@ > > #include <linux/types.h> > > > > #ifdef CONFIG_DETECT_HUNG_TASK > > -/* used for hung_task and block/ */ > > +/* used for hung_task, block/ and fuse */ > > extern unsigned long sysctl_hung_task_timeout_secs; > > +extern unsigned int sysctl_hung_task_panic; > > #else > > /* Avoid need for ifdefs elsewhere in the code */ > > -enum { sysctl_hung_task_timeout_secs = 0 }; > > +enum { > > + sysctl_hung_task_timeout_secs = 0, > > + sysctl_hung_task_panic = 0, > > +}; > > #endif > > > > enum sched_tunable_scaling { > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c > > index c18717189f32..16602d3754b1 100644 > > --- a/kernel/hung_task.c > > +++ b/kernel/hung_task.c > > @@ -78,8 +78,9 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace; > > * Should we panic (and reboot, if panic_timeout= is set) when a > > * hung task is detected: > > */ > > -static unsigned int __read_mostly sysctl_hung_task_panic = > > +unsigned int __read_mostly sysctl_hung_task_panic = > > IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC); > > +EXPORT_SYMBOL_GPL(sysctl_hung_task_panic); > > > > static int > > hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr) > > -- > > 2.34.1 > >