Re: RCU stalls and GPFs in ceph/netfs

Jeff Layton <jlayton@xxxxxxxxxx> · Sun, 28 Jul 2024 07:45:27 -0400

On Sun, 2024-07-28 at 12:49 +0200, Max Kellermann wrote:
> Hi David,
> 
> in the last few days, I have been chasing a 6.10 regression. After
> updating one of our servers from 6.9.10 to 6.10.1, I found various
> problems that may or may not be caused by the same code change
> (abbreviated):
> 
> [  108.043488] WARNING: CPU: 0 PID: 2860 at fs/ceph/caps.c:3386
> ceph_put_wrbuffer_cap_refs+0x1bb/0x1f0
> [  108.043498] Modules linked in:
> [  108.043500] CPU: 0 PID: 2860 Comm: rsync Not tainted 6.10.1-cm4all1-vm+ #117
> [  108.043505] RIP: 0010:ceph_put_wrbuffer_cap_refs+0x1bb/0x1f0
> [  108.043526] Call Trace:
> [  108.043573]  ? ceph_put_wrbuffer_cap_refs+0x27/0x1f0
> [  108.043575]  ceph_invalidate_folio+0x9a/0xc0
> [  108.043577]  truncate_cleanup_folio+0x52/0x90
> [  108.043583]  truncate_inode_pages_range+0xfe/0x400
> [  108.043585]  ? __rseq_handle_notify_resume+0x25b/0x480
> [  108.043589]  ? vfs_read+0x246/0x340
>
> [  108.043705] BUG: kernel NULL pointer dereference, address: 0000000000000356
> [  108.043948] #PF: supervisor write access in kernel mode
> [  108.044166] #PF: error_code(0x0002) - not-present page
> [  108.044341] PGD 0 P4D 0
> [  108.044465] Oops: Oops: 0002 [#1] SMP PTI
> [  108.048393] Call Trace:
> [  108.050002]  ? ceph_put_snap_context+0xf/0x30
> [  108.050178]  ceph_invalidate_folio+0xa2/0xc0
> [  108.050356]  truncate_cleanup_folio+0x52/0x90
> [  108.050532]  truncate_inode_pages_range+0xfe/0x400
> [  108.050711]  ? __rseq_handle_notify_resume+0x25b/0x480
> [  108.050896]  ? vfs_read+0x246/0x340
> 
> [  104.587469] Oops: general protection fault, probably for
> non-canonical address 0xe01ffbf110207cde: 0000 [#1] SMP KASAN PTI
> [  104.588429] KASAN: maybe wild-memory-access in range
> [0x00ffff888103e6f0-0x00ffff888103e6f7]
> [  104.588663] CPU: 6 PID: 2882 Comm: php-cgi8.1 Not tainted
> 6.10.1-cm4all1-vm+ #120
> [  104.591880] Call Trace:
> [  104.592015]  <TASK>
> [  104.592207]  ? die_addr+0x3c/0xa0
> [  104.592411]  ? exc_general_protection+0x113/0x200
> [  104.592650]  ? asm_exc_general_protection+0x22/0x30
> [  104.592876]  ? netfs_alloc_request+0x761/0xbd0
> [  104.593099]  ? netfs_read_folio+0x11f/0xad0
> [  104.593318]  ? netfs_read_folio+0xf6/0xad0
> [  104.593498]  ? filemap_get_read_batch+0x2dd/0x650
> [  104.593677]  ? __pfx_netfs_read_folio+0x10/0x10
> [  104.593854]  filemap_read_folio+0xb2/0x210
> [  104.594042]  ? __pfx_filemap_read_folio+0x10/0x10
> [  104.594226]  ? __pfx_stack_trace_save+0x10/0x10
> [  104.594417]  ? stack_depot_save_flags+0x24/0x690
> [  104.594616]  filemap_get_pages+0xaf8/0x1200
> [  104.594821]  ? __pfx_filemap_get_pages+0x10/0x10
> [  104.595012]  ? _raw_spin_lock+0x7a/0xd0
> 
>  rcu: INFO: rcu_sched self-detected stall on CPU
>  rcu:         7-....: (2099 ticks this GP)
> idle=4874/1/0x4000000000000000 softirq=1503/1503 fqs=1049
>  rcu:         (t=2100 jiffies g=4053 q=344 ncpus=16)
>  CPU: 7 PID: 3443 Comm: php-was Not tainted 6.9.0-vm+ #139
>  Call Trace:
>   <IRQ>
>   ? rcu_dump_cpu_stacks+0xed/0x170
>   ? rcu_sched_clock_irq+0x558/0xbc0
>   ? __smp_call_single_queue+0x8b/0xe0
>   ? update_process_times+0x69/0xa0
>   ? tick_nohz_handler+0x87/0x120
>   ? __pfx_tick_nohz_handler+0x10/0x10
>   ? __hrtimer_run_queues+0x110/0x250
>   ? hrtimer_interrupt+0xf6/0x230
>   ? __sysvec_apic_timer_interrupt+0x51/0x120
>   ? sysvec_apic_timer_interrupt+0x60/0x80
>   </IRQ>
>   <TASK>
>   ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>   ? __xas_next+0xc6/0xe0
>   filemap_get_read_batch+0x163/0x240
>   filemap_get_pages+0x9a/0x5c0
>   filemap_read+0xd9/0x310
>   ? __ceph_get_caps+0xd5/0x580
>   ? __ceph_caps_issued_mask+0x156/0x210
> 
> These servers have Ceph mounts with fscache.
> 
> At first, it looked like these bugs could be triggered easily and I
> did a bisect; however the WARNING in Ceph code went away, and going
> back further in the git history made reproducing the problem less
> likely. I tried KASAN but it didn't give any more information.
> 
> After a few dead ends (due to some false "good" commits), the bisect
> arrived at your commit 2e9d7e4b984a61 ("mm: Remove the PG_fscache
> alias for PG_private_2"). This commit easily reproduces the RCU stalls
> on my server. The preceding commit 2ff1e97587f4d3 ("netfs: Replace
> PG_fscache by setting folio->private and marking dirty") never did,
> not even after an hour of rebooting and retrying.
> 

That is really weird. AFAICT, 2e9d7e4b984a61 is just removing some
wrapper functions and changing the names of some others. There should
be no functional changes there. I'll keep looking but I don't see how
that commit would break anything.

> This is how the RCU stall looks like on 2e9d7e4b984a61:
> 
>  rcu: INFO: rcu_sched self-detected stall on CPU
>  rcu:         6-....: (46220 ticks this GP)
> idle=4bfc/1/0x4000000000000000 softirq=1594/1594 fqs=21190
>  rcu:         (t=46221 jiffies g=2577 q=7220 ncpus=16)
>  CPU: 6 PID: 3119 Comm: wordpress-manag Not tainted
> 6.9.0-rc6-vm-00004-gae678317b95e #160
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
>  RIP: 0010:filemap_get_read_batch+0x7b/0x240
>  Code: 48 c7 44 24 30 00 00 00 00 48 89 e7 e8 ee d4 99 00 48 85 c0 0f
> 84 bf 00 00 00 48 89 c7 48 81 ff 06 04 00 00 0f 84 fd 00 00 00 <48> 81
> ff 02 04 00 00 0f 84 ca 00 00 00 48 3b 6c 24 08 0f 82 97 00
>  RSP: 0018:ffffbb2500fef8f8 EFLAGS: 00000296
>  RAX: ffffefa30509c740 RBX: ffffbb2500fefa50 RCX: 0000000000000000
>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffefa30509c740
>  RBP: 0000000000000003 R08: 0000000000000000 R09: ffffbb2500fefaec
>  R10: ffff9acb4a77ec80 R11: 0000000000000000 R12: ffffbb2500fefc58
>  R13: ffff9acb5c381d00 R14: 0000000000000000 R15: ffffbb2500fefc80
>  FS:  00007f5f6b6eeb80(0000) GS:ffff9ad64ef80000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00005623dea2b3d8 CR3: 000000011b658003 CR4: 00000000001706b0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  Call Trace:
>   <IRQ>
>   ? rcu_dump_cpu_stacks+0xed/0x170
>   ? rcu_sched_clock_irq+0x558/0xbc0
>   ? __smp_call_single_queue+0x8b/0xe0
>   ? update_process_times+0x69/0xa0
>   ? tick_nohz_handler+0x87/0x120
>   ? __pfx_tick_nohz_handler+0x10/0x10
>   ? __hrtimer_run_queues+0x110/0x250
>   ? hrtimer_interrupt+0xf6/0x230
>   ? __sysvec_apic_timer_interrupt+0x51/0x120
>   ? sysvec_apic_timer_interrupt+0x60/0x80
>   </IRQ>
>   <TASK>
>   ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>   ? filemap_get_read_batch+0x7b/0x240
>   ? filemap_get_read_batch+0x163/0x240
>   filemap_get_pages+0x9a/0x5c0
>   ? __call_rcu_common.constprop.0+0x87/0x220
>   filemap_read+0xd9/0x310
>   ? __ceph_get_caps+0xd5/0x580
>   ? terminate_walk+0x95/0x100
>   ? path_openat+0xca3/0xf10
>   ceph_read_iter+0x3e1/0x620
>   vfs_read+0x23a/0x330
>   ksys_read+0x63/0xe0
>   do_syscall_64+0x66/0x100
>   ? __alloc_pages+0x176/0x2e0
>   ? __count_memcg_events+0x4e/0xb0
>   ? __pte_offset_map_lock+0x60/0xe0
>   ? __mod_memcg_lruvec_state+0x89/0x110
>   ? __lruvec_stat_mod_folio+0x41/0x70
>   ? do_anonymous_page+0x6dc/0x840
>   ? __handle_mm_fault+0x936/0x1250
>   ? __count_memcg_events+0x4e/0xb0
>   ? handle_mm_fault+0xa2/0x2a0
>   ? do_user_addr_fault+0x308/0x5d0
>   ? exc_page_fault+0x62/0x120
>   ? irqentry_exit_to_user_mode+0x40/0xf0
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> When that happens, the process is at 100% CPU usage, but gdb/strace
> cannot attach, and /proc/PID/stack is empty. I tried "perf record" and
> found the process was busy-looping inside filemap_get_pages(), calling
> filemap_get_read_batch() over and over, saw some xas_ calls but
> nothing else.
> 
> Your commit 2e9d7e4b984a61 is too obscure for me, I don't know that
> part of the kernel, and I can't imagine how it can cause such a
> regression, but maybe you have an idea?
> 
> I can get you more information or try patches with more debugging code
> if you want.
> 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>