Re: [PATCH RESEND v9 2/3] fuse: add optional kernel-enforced timeout for requests

Joanne Koong <joannelkoong@xxxxxxxxx> · Thu, 5 Dec 2024 16:06:37 -0800

On Tue, Dec 3, 2024 at 3:01 AM Sergey Senozhatsky
<senozhatsky@xxxxxxxxxxxx> wrote:
>
> On (24/11/14 11:13), Joanne Koong wrote:
> [..]
> > +static void fuse_init_fc_timeout(struct fuse_conn *fc, struct fuse_fs_context *ctx)
> > +{
> > +     if (ctx->req_timeout) {
> > +             if (check_mul_overflow(ctx->req_timeout * 60, HZ, &fc->timeout.req_timeout))
> > +                     fc->timeout.req_timeout = ULONG_MAX;
> > +             timer_setup(&fc->timeout.timer, fuse_check_timeout, 0);
> > +             mod_timer(&fc->timeout.timer, jiffies + FUSE_TIMEOUT_TIMER_FREQ);
>
> So I think this will require much bigger changes in the code.
> fuse_check_timeout() is executed from IRQ context and it takes
> the same locks that are acquired by preemptive task contexts.
> So all of those now need to disable local IRQs before they
> acquire: fc->bg_lock, fig->lock, fc->lock.  Otherwise we are
> in a deadlock scenario (in many places, unfortunately).
>
> A simple example:
>
> [   91.466408]   _raw_spin_lock+0x39/0x70
> [   91.466420]   fuse_simple_background+0x902/0x1130 [fuse]
> [   91.466453]   fuse_send_init+0x337/0x680 [fuse]
> [   91.466485]   fuse_fill_super+0x1c8/0x200 [fuse]
> [   91.466516]   get_tree_nodev+0xaf/0x140
> [   91.466527]   fuse_get_tree+0x27e/0x450 [fuse]
> [   91.466559]   vfs_get_tree+0x88/0x240
> [   91.466569]   path_mount+0xf26/0x1ed0
> [   91.466579]   __se_sys_mount+0x1c9/0x240
> [   91.466588]   do_syscall_64+0x6f/0xa0
> [   91.466598]   entry_SYSCALL_64_after_hwframe+0x73/0xdd
> [   91.466666]
>                other info that might help us debug this:
> [   91.466672]  Possible unsafe locking scenario:
>
> [   91.466679]        CPU0
> [   91.466684]        ----
> [   91.466689]   lock(&fiq->lock);
> [   91.466701]   <Interrupt>
> [   91.466707]     lock(&fiq->lock);
> [   91.466718]
>                 *** DEADLOCK ***
>
> [   91.466724] 4 locks held by runtime_probe/5043:
> [   91.466732]  #0: ffff888005812448 (sb_writers#3){.+.+}-{0:0}, at: filename_create+0xb2/0x320
> [   91.466762]  #1: ffff8881499ea3f0 (&type->i_mutex_dir_key#5/1){+.+.}-{3:3}, at: filename_create+0x14c/0x320
> [   91.466791]  #2: ffffffff9d864ce0 (rcu_read_lock){....}-{1:2}, at: security_sid_to_context_core+0xa4/0x4f0
> [   91.466817]  #3: ffff88815c009ec0 ((&fc->timeout.timer)){+.-.}-{0:0}, at: run_timer_softirq+0x702/0x1700
> [   91.466841]
>                stack backtrace:
> [   91.466850] CPU: 2 PID: 5043 Comm: runtime_probe Tainted: G     U             6.6.63-lockdep #1 f2b045305e587e03c4766ca818bb77742f953c87
> [   91.466864] Hardware name: HP Lantis/Lantis, BIOS Google_Lantis.13606.437.0 01/21/2022
> [   91.466872] Call Trace:
> [   91.466881]  <IRQ>
> [   91.466889]  dump_stack_lvl+0x6d/0xb0
> [   91.466904]  print_usage_bug+0x8a4/0xbb0
> [   91.466917]  mark_lock+0x13ca/0x1940
> [   91.466930]  __lock_acquire+0xc10/0x2670
> [   91.466944]  lock_acquire+0x119/0x3a0
> [   91.466955]  ? fuse_check_timeout+0x32/0x630 [fuse c290dfa1c114772028056af34b9662cba2d155b5]
> [   91.466992]  ? fuse_check_timeout+0x32/0x630 [fuse c290dfa1c114772028056af34b9662cba2d155b5]
> [   91.467025]  _raw_spin_lock+0x39/0x70
> [   91.467036]  ? fuse_check_timeout+0x32/0x630 [fuse c290dfa1c114772028056af34b9662cba2d155b5]
> [   91.467070]  fuse_check_timeout+0x32/0x630 [fuse c290dfa1c114772028056af34b9662cba2d155b5]
> [   91.467104]  ? run_timer_softirq+0x702/0x1700
> [   91.467114]  ? run_timer_softirq+0x702/0x1700
> [   91.467123]  ? __pfx_fuse_check_timeout+0x10/0x10 [fuse c290dfa1c114772028056af34b9662cba2d155b5]
> [   91.467156]  run_timer_softirq+0x763/0x1700
> [   91.467172]  irq_exit_rcu+0x300/0x7d0
> [   91.467183]  ? sysvec_apic_timer_interrupt+0x56/0x90
> [   91.467195]  sysvec_apic_timer_interrupt+0x56/0x90
> [   91.467205]  </IRQ>
>
> Do we want to run fuse-watchdog as a preemptive task instead of
> IRQ context?  Simlar to the way the hung-task watchdog runs, for
> example.  Yes, it now may starve and not get scheduled in corner
> cases (under some crazy pressure), but at least we don't need to
> patch the entire fuse code to use irq_safe/irq_restore locking
> variants.

Interesting! Thanks for noting this.

It looks like the choices we have here then are to either:

* have this run in a kthread like hung-task watchdog, as you mentioned above
* have all fuse code use irq_safe/irq_restore for locks
* use a separate spinlock/list for request timeouts as per the
implementation in v7 [1], though will need to benchmark this to make
sure performance doesn't degrade from lock contention if lots of
requests are submitted in parallel

Having 1 extra kthread per server as overhead doesn't seem too bad to
me. There's already an upper margin of imprecision with the timer
implementation, so I don't see the kthread scheduling imprecision as a
big deal.

I'll restructure this to use a kthread in v10. If anyone has thoughts
on a better or more preferred approach though, would love to hear
them.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20241007184258.2837492-3-joannelkoong@xxxxxxxxx/