Hi Song, I'm wondering if you can help shed some light on a bug I'm trying to track down. We're hitting the BUG_ON in handle_parity_checks5() that tests to ensure R5_UPTODATE is set for a failed disk in a stripe[1]. We hit this in our test suite somewhat rarely when the journal is enabled doing device removal and recovery tests. We've concocted a test that can hit it in under ten minutes. After some debugging I've found that the stripe that hits the BUG_ON is hitting a conditional in handle_stripe_fill() for stripes that are in the journal with a failed disk[2]. This check was added in 2017 by your patch: 07e83364845e ("md/r5cache: shift complex rmw from read path to write path") A stripe that hits the bug has one injournal dev, and one failed dev and does not have STRIPE_R5C_CACHING set and therefore hits the conditional and returns from handle_stripe_fill() without calling fetch_block() or doing anything else to change the flow of execution. Normally, fetch_block() would set STRIPE_COMPUTE_RUN to recompute the missing disk, however that gets skipped for this case. After returning from handle_stripe_fill(), handle_stripe() will then call handle_parity_checks5() because STRIPE_COMPUTE_RUN was not set and this will immediately hit the BUG_ON, because nothing has computed the disk and set it UPTODATE yet. I can't say I fully understand the patch that added this, so I don't really understand why that conditional is there or what it's trying to accomplish and thus I don't know what the correct solution might be. Any thoughts? Thanks, Logan [1] https://elixir.bootlin.com/linux/v6.0-rc1/source/drivers/md/raid5.c#L4381 [2] https://elixir.bootlin.com/linux/v6.0-rc1/source/drivers/md/raid5.c#L4050