On Tue, Dec 3, 2024 at 3:01 AM Sergey Senozhatsky <senozhatsky@xxxxxxxxxxxx> wrote: > > On (24/11/14 11:13), Joanne Koong wrote: > [..] > > +static void fuse_init_fc_timeout(struct fuse_conn *fc, struct fuse_fs_context *ctx) > > +{ > > + if (ctx->req_timeout) { > > + if (check_mul_overflow(ctx->req_timeout * 60, HZ, &fc->timeout.req_timeout)) > > + fc->timeout.req_timeout = ULONG_MAX; > > + timer_setup(&fc->timeout.timer, fuse_check_timeout, 0); > > + mod_timer(&fc->timeout.timer, jiffies + FUSE_TIMEOUT_TIMER_FREQ); > > So I think this will require much bigger changes in the code. > fuse_check_timeout() is executed from IRQ context and it takes > the same locks that are acquired by preemptive task contexts. > So all of those now need to disable local IRQs before they > acquire: fc->bg_lock, fig->lock, fc->lock. Otherwise we are > in a deadlock scenario (in many places, unfortunately). > > A simple example: > > [ 91.466408] _raw_spin_lock+0x39/0x70 > [ 91.466420] fuse_simple_background+0x902/0x1130 [fuse] > [ 91.466453] fuse_send_init+0x337/0x680 [fuse] > [ 91.466485] fuse_fill_super+0x1c8/0x200 [fuse] > [ 91.466516] get_tree_nodev+0xaf/0x140 > [ 91.466527] fuse_get_tree+0x27e/0x450 [fuse] > [ 91.466559] vfs_get_tree+0x88/0x240 > [ 91.466569] path_mount+0xf26/0x1ed0 > [ 91.466579] __se_sys_mount+0x1c9/0x240 > [ 91.466588] do_syscall_64+0x6f/0xa0 > [ 91.466598] entry_SYSCALL_64_after_hwframe+0x73/0xdd > [ 91.466666] > other info that might help us debug this: > [ 91.466672] Possible unsafe locking scenario: > > [ 91.466679] CPU0 > [ 91.466684] ---- > [ 91.466689] lock(&fiq->lock); > [ 91.466701] <Interrupt> > [ 91.466707] lock(&fiq->lock); > [ 91.466718] > *** DEADLOCK *** > > [ 91.466724] 4 locks held by runtime_probe/5043: > [ 91.466732] #0: ffff888005812448 (sb_writers#3){.+.+}-{0:0}, at: filename_create+0xb2/0x320 > [ 91.466762] #1: ffff8881499ea3f0 (&type->i_mutex_dir_key#5/1){+.+.}-{3:3}, at: filename_create+0x14c/0x320 > [ 91.466791] #2: ffffffff9d864ce0 (rcu_read_lock){....}-{1:2}, at: security_sid_to_context_core+0xa4/0x4f0 > [ 91.466817] #3: ffff88815c009ec0 ((&fc->timeout.timer)){+.-.}-{0:0}, at: run_timer_softirq+0x702/0x1700 > [ 91.466841] > stack backtrace: > [ 91.466850] CPU: 2 PID: 5043 Comm: runtime_probe Tainted: G U 6.6.63-lockdep #1 f2b045305e587e03c4766ca818bb77742f953c87 > [ 91.466864] Hardware name: HP Lantis/Lantis, BIOS Google_Lantis.13606.437.0 01/21/2022 > [ 91.466872] Call Trace: > [ 91.466881] <IRQ> > [ 91.466889] dump_stack_lvl+0x6d/0xb0 > [ 91.466904] print_usage_bug+0x8a4/0xbb0 > [ 91.466917] mark_lock+0x13ca/0x1940 > [ 91.466930] __lock_acquire+0xc10/0x2670 > [ 91.466944] lock_acquire+0x119/0x3a0 > [ 91.466955] ? fuse_check_timeout+0x32/0x630 [fuse c290dfa1c114772028056af34b9662cba2d155b5] > [ 91.466992] ? fuse_check_timeout+0x32/0x630 [fuse c290dfa1c114772028056af34b9662cba2d155b5] > [ 91.467025] _raw_spin_lock+0x39/0x70 > [ 91.467036] ? fuse_check_timeout+0x32/0x630 [fuse c290dfa1c114772028056af34b9662cba2d155b5] > [ 91.467070] fuse_check_timeout+0x32/0x630 [fuse c290dfa1c114772028056af34b9662cba2d155b5] > [ 91.467104] ? run_timer_softirq+0x702/0x1700 > [ 91.467114] ? run_timer_softirq+0x702/0x1700 > [ 91.467123] ? __pfx_fuse_check_timeout+0x10/0x10 [fuse c290dfa1c114772028056af34b9662cba2d155b5] > [ 91.467156] run_timer_softirq+0x763/0x1700 > [ 91.467172] irq_exit_rcu+0x300/0x7d0 > [ 91.467183] ? sysvec_apic_timer_interrupt+0x56/0x90 > [ 91.467195] sysvec_apic_timer_interrupt+0x56/0x90 > [ 91.467205] </IRQ> > > Do we want to run fuse-watchdog as a preemptive task instead of > IRQ context? Simlar to the way the hung-task watchdog runs, for > example. Yes, it now may starve and not get scheduled in corner > cases (under some crazy pressure), but at least we don't need to > patch the entire fuse code to use irq_safe/irq_restore locking > variants. Interesting! Thanks for noting this. It looks like the choices we have here then are to either: * have this run in a kthread like hung-task watchdog, as you mentioned above * have all fuse code use irq_safe/irq_restore for locks * use a separate spinlock/list for request timeouts as per the implementation in v7 [1], though will need to benchmark this to make sure performance doesn't degrade from lock contention if lots of requests are submitted in parallel Having 1 extra kthread per server as overhead doesn't seem too bad to me. There's already an upper margin of imprecision with the timer implementation, so I don't see the kthread scheduling imprecision as a big deal. I'll restructure this to use a kthread in v10. If anyone has thoughts on a better or more preferred approach though, would love to hear them. Thanks, Joanne [1] https://lore.kernel.org/linux-fsdevel/20241007184258.2837492-3-joannelkoong@xxxxxxxxx/