On Sat, 2024-07-13 at 19:06 +0800, Yu Kuai wrote: > Hi, > > 在 2024/07/12 20:11, Konstantin Kharlamov 写道: > > Good news: you diff seems to have fixed the problem! I would have > > to > > test more extensively in another environment to be completely sure, > > but > > by following the minimal steps-to-reproduce I can no longer > > reproduce > > the problem, so it seems to have fixed the problem. > > That's good. :) > > > > Bad news: there's a new lockup now 😄 This one seems to happen > > after > > the disk is returned back; unless the action of returning back > > matches > > accidentally the appearing stacktraces, which still might be > > possible > > even though I re-tested multiple times. It's because the traces > > (below) seems not to always appear. However, even when traces do > > not > > appear, IO load on the fio that's running in the background drops > > to > > zero, so something seems definitely wrong. > > Ok, I need to investigate more for this. The call stack is not much > helpful. Is it not helpful because of missing line numbers or in general? If it's the missing line numbers I'll try to fix that. We're using some Debian scripts that create deb packages, and well, they don't work well with debug information (it's being put to separate package, but even if it's installed the kernel traces still don't have line numbers). I didn't investigate into it, but I can if that will help. > At first, can the problem reporduce with raid1/raid10? If not, this > is > probably a raid5 bug. This is not reproducible with raid1 (i.e. no lockups for raid1), I tested that. I didn't test raid10, if you want I can try (but probably only after the weekend, because today I was asked to give the nodes away, for the weekend at least, to someone else). > The best will be that if I can reporduce this problem myself. > The problem is that I don't understand the step 4: turning off jbod > slot's power, is this only possible for a real machine, or can I do > this in my VM? Well, let's say that if it is possible, I don't know a way to do that. The `sg_ses` commands that I used sg_ses --dev-slot-num=9 --set=3:4:1 /dev/sg26 # turning off sg_ses --dev-slot-num=9 --clear=3:4:1 /dev/sg26 # turning on …sets and clears the value of the 3:4:1 bit, where the bit is defined by the JBOD's manufacturer datasheet. The 3:4:1 specifically is defined by "AIC" manufacturer. That means the command as is unlikely to work on a different hardware. Well, while on it, do you have any thoughts why just using a `echo 1 > /sys/block/sdX/device/delete` doesn't reproduce it? Does perhaps kernel not emulate device disappearance too well?