Re: Lockup of (raid5 or raid6) + vdo after taking out a disk under load

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Sat, 13 Jul 2024 19:06:10 +0800

Hi,

在 2024/07/12 20:11, Konstantin Kharlamov 写道:
Good news: you diff seems to have fixed the problem! I would have to
test more extensively in another environment to be completely sure, but
by following the minimal steps-to-reproduce I can no longer reproduce
the problem, so it seems to have fixed the problem.

That's good. :)

Bad news: there's a new lockup now 😄 This one seems to happen after
the disk is returned back; unless the action of returning back matches
accidentally the appearing stacktraces, which still might be possible
even though I re-tested multiple times. It's because the traces
(below) seems not to always appear. However, even when traces do not
appear, IO load on the fio that's running in the background drops to
zero, so something seems definitely wrong.

Ok, I need to investigate more for this. The call stack is not much
helpful.

At first, can the problem reporduce with raid1/raid10? If not, this is
probably a raid5 bug.

The best will be that if I can reporduce this problem myself.
The problem is that I don't understand the step 4: turning off jbod
slot's power, is this only possible for a real machine, or can I do
this in my VM?

Thanks,
Kuai