On 2023-10-03 00:48, Song Liu wrote: > CC Logan. > > On Mon, Oct 2, 2023 at 12:22 PM John Pittman <jpittman@xxxxxxxxxx> wrote: >> >> Thanks a lot David. Song, as a note, David's patch was tested by a >> Red Hat customer and it indeed resolved their hit on the deadlock. >> cc. Laurence Oberman who assisted on that case. >> >> >> On Mon, Oct 2, 2023 at 2:39 PM David Jeffery <djeffery@xxxxxxxxxx> wrote: >>> >>> When raid5_get_active_stripe is called with a ctx containing a stripe_head in >>> its batch_last pointer, it can cause a deadlock if the task sleeps waiting on >>> another stripe_head to become available. The stripe_head held by batch_last >>> can be blocking the advancement of other stripe_heads, leading to no >>> stripe_heads being released so raid5_get_active_stripe waits forever. >>> >>> Like with the quiesce state handling earlier in the function, batch_last >>> needs to be released by raid5_get_active_stripe before it waits for another >>> stripe_head. >>> >>> >>> Fixes: 3312e6c887fe ("md/raid5: Keep a reference to last stripe_head for batch") >>> Signed-off-by: David Jeffery <djeffery@xxxxxxxxxx> This makes sense to me. Nice catch on the difficult bug. Reviewed-by: Logan Gunthorpe <logang@xxxxxxxxxxxx> Logan