Re: KASAN: global-out-of-bounds Read in srcu_gp_start_if_needed

Joel Fernandes <joelagnelf@xxxxxxxxxx> · Tue, 4 Mar 2025 10:18:47 -0500

On 3/4/2025 10:11 AM, Steven Rostedt wrote:
> On Mon, 3 Mar 2025 22:57:32 -0500
> Joel Fernandes <joelagnelf@xxxxxxxxxx> wrote:
> 
>>>
>>> The lock taken is from the passed in rcu_pending pointer.
>>>   
>>>> [   92.322655][   T28]  rcu_pending_enqueue+0x686/0xd30
>>>> [   92.322676][   T28]  ? __pfx_rcu_pending_enqueue+0x10/0x10
>>>> [   92.322693][   T28]  ? trace_lock_release+0x11a/0x180
>>>> [   92.322708][   T28]  ? bkey_cached_free+0xa3/0x170
>>>> [   92.322725][   T28]  ? lock_release+0x13/0x180
>>>> [   92.322744][   T28]  ? bkey_cached_free+0xa3/0x170
>>>> [   92.322760][   T28]  bkey_cached_free+0xfd/0x170  
>>>
>>> Which has:
>>>
>>> static void bkey_cached_free(struct btree_key_cache *bc,
>>>                              struct bkey_cached *ck)
>>> {
>>>         kfree(ck->k);
>>>         ck->k           = NULL;
>>>         ck->u64s        = 0;
>>>                 
>>>         six_unlock_write(&ck->c.lock);
>>>         six_unlock_intent(&ck->c.lock);
>>>
>>>         bool pcpu_readers = ck->c.lock.readers != NULL;
>>>         rcu_pending_enqueue(&bc->pending[pcpu_readers], &ck->rcu);
>>>         this_cpu_inc(*bc->nr_pending);
>>> }
>>>
>>> So if that bc->pending[pcpu_readers] gets corrupted in anyway, that could trigger this.  
>>
>> True, another thing that could corrupt it is if per-cpu global data section
>> section is corrupted, because the crash is happening in this trylock per the
>> above stack:
>>
>>  srcu_gp_start_if_needed ->
>> 	spin_lock_irqsave_sdp_contention(sdp) ->
>> 		spin_trylock(sdp->lock)
>>
>> 	where sdp is ssp->sda and is allocated from per-cpu storage.
>>
>> So corruption of the per-cpu global data section can also trigger this, even
>> if the rcu_pending pointer is intact.
> 
> If there was corruption of the per-cpu section, you would think it would
> have a bigger impact than just this location. As most of the kernel relies
> on the per-cpu section.
> 
> But it could be corruption of the per-cpu variable that was used. Caused by
> the code that uses it.
> 
> That code is quite complex, and I usually try to rule out the code that is
> used in one location as being the issue before looking at something like
> per-cpu or RCU code which is used all over the place, and if that was
> buggy, it would likely blow up elsewhere outside of bcachefs.

Your strategy does make sense, as usually bugs are isolated though FWIW, we are
in a monolithic world leading to some definition of "isolated" ;-)

> But who knows, perhaps the bcachefs triggered a corner case?

Yeah could be. Anyway, lets see if the complaint comes back. ;-)

 - Joel