Hi Guoqing, Thanks for looking at this. On Wed, Jun 16, 2021 at 11:57:33AM +0800, Guoqing Jiang wrote: > The above looks like the bio for sb write was throttled by wbt, which caused > the first calltrace. > I am wondering if there were intensive IOs happened to the > underlying device of md5, which triggered wbt to throttle sb > write, or can you access the underlying device directly? Next time it occurs I can check if I am able to read from the SSDs that make up the MD device, if that information would be helpful. I have never been able to replicate the problem in a test environment so it is likely that it needs to be under heavy load for it to happen. > And there was a report [1] for raid5 which may related to wbt throttle as > well, not sure if the > change [2] could help or not. > > [1]. https://lore.kernel.org/linux-raid/d3fced3f-6c2b-5ffa-fd24-b24ec6e7d4be@xxxxxxxxxxxx/ > [2]. https://lore.kernel.org/linux-raid/cb0f312e-55dc-cdc4-5d2e-b9b415de617f@xxxxxxxxx/ All of my MD arrays tend to be RAID-1 or RAID-10, two devices, no journal, internal bitmap. I see the reporter of this problem was using RAID-6 with an external write journal. I can still build a kernel with this patch and try it out, if you think it could possibly help. The long time between incidents obviously makes things extra challenging. The next step I have taken is to put the buster-backports kernel package (5.10.24-1~bpo10+1) on two test servers, and will also boot the production hosts into this if they should experience the problem again. Thanks, Andy