On 3/4/2025 10:11 AM, Steven Rostedt wrote: > On Mon, 3 Mar 2025 22:57:32 -0500 > Joel Fernandes <joelagnelf@xxxxxxxxxx> wrote: > >>> >>> The lock taken is from the passed in rcu_pending pointer. >>> >>>> [ 92.322655][ T28] rcu_pending_enqueue+0x686/0xd30 >>>> [ 92.322676][ T28] ? __pfx_rcu_pending_enqueue+0x10/0x10 >>>> [ 92.322693][ T28] ? trace_lock_release+0x11a/0x180 >>>> [ 92.322708][ T28] ? bkey_cached_free+0xa3/0x170 >>>> [ 92.322725][ T28] ? lock_release+0x13/0x180 >>>> [ 92.322744][ T28] ? bkey_cached_free+0xa3/0x170 >>>> [ 92.322760][ T28] bkey_cached_free+0xfd/0x170 >>> >>> Which has: >>> >>> static void bkey_cached_free(struct btree_key_cache *bc, >>> struct bkey_cached *ck) >>> { >>> kfree(ck->k); >>> ck->k = NULL; >>> ck->u64s = 0; >>> >>> six_unlock_write(&ck->c.lock); >>> six_unlock_intent(&ck->c.lock); >>> >>> bool pcpu_readers = ck->c.lock.readers != NULL; >>> rcu_pending_enqueue(&bc->pending[pcpu_readers], &ck->rcu); >>> this_cpu_inc(*bc->nr_pending); >>> } >>> >>> So if that bc->pending[pcpu_readers] gets corrupted in anyway, that could trigger this. >> >> True, another thing that could corrupt it is if per-cpu global data section >> section is corrupted, because the crash is happening in this trylock per the >> above stack: >> >> srcu_gp_start_if_needed -> >> spin_lock_irqsave_sdp_contention(sdp) -> >> spin_trylock(sdp->lock) >> >> where sdp is ssp->sda and is allocated from per-cpu storage. >> >> So corruption of the per-cpu global data section can also trigger this, even >> if the rcu_pending pointer is intact. > > If there was corruption of the per-cpu section, you would think it would > have a bigger impact than just this location. As most of the kernel relies > on the per-cpu section. > > But it could be corruption of the per-cpu variable that was used. Caused by > the code that uses it. > > That code is quite complex, and I usually try to rule out the code that is > used in one location as being the issue before looking at something like > per-cpu or RCU code which is used all over the place, and if that was > buggy, it would likely blow up elsewhere outside of bcachefs. Your strategy does make sense, as usually bugs are isolated though FWIW, we are in a monolithic world leading to some definition of "isolated" ;-) > But who knows, perhaps the bcachefs triggered a corner case? Yeah could be. Anyway, lets see if the complaint comes back. ;-) - Joel