Re: Lockup of (raid5 or raid6) + vdo after taking out a disk under load

Konstantin Kharlamov <Hi-Angel@xxxxxxxxx> · Sat, 13 Jul 2024 16:50:21 +0300

On Sat, 2024-07-13 at 19:06 +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2024/07/12 20:11, Konstantin Kharlamov 写道:
> > Good news: you diff seems to have fixed the problem! I would have
> > to
> > test more extensively in another environment to be completely sure,
> > but
> > by following the minimal steps-to-reproduce I can no longer
> > reproduce
> > the problem, so it seems to have fixed the problem.
> 
> That's good. :)
> > 
> > Bad news: there's a new lockup now 😄 This one seems to happen
> > after
> > the disk is returned back; unless the action of returning back
> > matches
> > accidentally the appearing stacktraces, which still might be
> > possible
> > even though I re-tested multiple times. It's because the traces
> > (below) seems not to always appear. However, even when traces do
> > not
> > appear, IO load on the fio that's running in the background drops
> > to
> > zero, so something seems definitely wrong.
> 
> Ok, I need to investigate more for this. The call stack is not much
> helpful.

Is it not helpful because of missing line numbers or in general? If
it's the missing line numbers I'll try to fix that. We're using some
Debian scripts that create deb packages, and well, they don't work well
with debug information (it's being put to separate package, but even if
it's installed the kernel traces still don't have line numbers). I
didn't investigate into it, but I can if that will help. 

> At first, can the problem reporduce with raid1/raid10? If not, this
> is
> probably a raid5 bug.

This is not reproducible with raid1 (i.e. no lockups for raid1), I
tested that. I didn't test raid10, if you want I can try (but probably
only after the weekend, because today I was asked to give the nodes
away, for the weekend at least, to someone else).

> The best will be that if I can reporduce this problem myself.
> The problem is that I don't understand the step 4: turning off jbod
> slot's power, is this only possible for a real machine, or can I do
> this in my VM?

Well, let's say that if it is possible, I don't know a way to do that.
The `sg_ses` commands that I used

	sg_ses --dev-slot-num=9 --set=3:4:1   /dev/sg26 # turning off
	sg_ses --dev-slot-num=9 --clear=3:4:1 /dev/sg26 # turning on

…sets and clears the value of the 3:4:1 bit, where the bit is defined
by the JBOD's manufacturer datasheet. The 3:4:1 specifically is defined
by "AIC" manufacturer. That means the command as is unlikely to work on
a different hardware.

Well, while on it, do you have any thoughts why just using a `echo 1 >
/sys/block/sdX/device/delete` doesn't reproduce it? Does perhaps kernel
not emulate device disappearance too well?