Re: "Power-on or device reset occurred" after a LUN resize

Jean-Louis Dupond <jean-louis@xxxxxxxxx> · Mon, 30 Jan 2023 09:39:21 +0100

Hi All,

Last week we've had the same issue again, but luckily we did some more 
debugging actions when it occurred.
And it looks like a NetApp issue but also partly a kernel issue.

Let me describe:

Jan 23 13:36:57 srv001 kernel: sd 15:0:0:0: Capacity data has changed
Jan 23 13:37:57 srv001 kernel: sd 15:0:0:0: Power-on or device reset 
occurred
We do a LUN resize, and exactly 1 minute later -> A 'Power-on or device 
reset' event.
Now what it looks like, is that the SAN here did not send some 
confirmation for a read/write, and the kernel tried to abort it, but failed.
So it ended up sending a Logical Unit Reset to recover.
-> So this seems to be clearly a SAN bug, as it should always confirm 
read/write.

But then:
When the LUR was send by a host (20 hosts are connected to the same LUN 
here), the following seems to happen:
- Client (linux host) sends a write x
- Client (linux host) sends a write y
- SAN respons with a check condition 0x29 (Power-on or device reset) on 
write y
- Client (linux host) sends a NOP Out after 30 seconds
- NetApp responds with a NOP In
- Client sends an abort for write x after 1 minute (as it was still not 
confirmed from the netapp side)
- NetApp responds with '0x01' (Task not in set)
- Client sends a LUR to the NetApp to reset again, as it still didn't 
know what happend with write x and could not abort it
- The LUR completes, and causes the same issue again on other hosts.

As the NetApp seems to skip write confirmations during a reset, we end 
up in a reset storm because write confirmations get skipped, and a new 
reset is being send again.
And as this happens on all the 20 hosts, this causes an endless reset storm.

Now while I think the NetApp should never skip confirmations of 
read/writes, I think the kernel should remove all non-confirmed 
writes/reads on a LUR event?
This is what SAM-2 
(https://www.cs.cmu.edu/afs/club/usr/jhutz/project/Archives/scsi/sam2r24.pdf) 
specification tells about it:
To process a logical unit reset the logical unit shall:
a)Abort all tasks as described in 5.7;

If the kernel on all hosts would remove the unconfirmed writes/reads on 
all hosts after a 0x29, it would never send an abort (which fails), and 
then no more LUR's would be send by the kernel.
And everything would recover correctly after the first LUR.

It would be great if we could improve this :)
There is an issue open on NetApp side also for this!

Thanks
Jean-Louis

On 23/09/2020 11:17, Jean-Louis Dupond wrote:
Hi,

Last week some action that we do regularly caused some issues.

00:50:31 CEST -> We resized a iSCSI LUN on a SAN from 3TB -> 4TB.

The clients did detect the change fine, and resized it devices:

Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: Capacity data has changed
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: Inquiry data has changed
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: supports 
implicit TPGS
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: device 
t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e9 rel port 8
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: Capacity data has changed
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: [sdf] 8589934592 
512-byte logical blocks: (4.40 TB/4.00 TiB)
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: [sdf] 4096-byte 
physical blocks
Sep 22 00:51:07 server001 kernel: sdf: detected capacity change from 
3298534883328 to 4398046511104
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: port group 3e9 
state A non-preferred supports TolUsNA
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: Inquiry data has changed
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: supports 
implicit TPGS
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: device 
t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e9 rel port 7
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: [sdi] 8589934592 
512-byte logical blocks: (4.40 TB/4.00 TiB)
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: [sdi] 4096-byte 
physical blocks
Sep 22 00:51:07 server001 kernel: sdi: detected capacity change from 
3298534883328 to 4398046511104
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: port group 3e9 
state A non-preferred supports TolUsNA
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: Capacity data has changed
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: Inquiry data has changed
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: supports 
implicit TPGS
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: device 
t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e8 rel port 6
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: [sdl] 8589934592 
512-byte logical blocks: (4.40 TB/4.00 TiB)
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: [sdl] 4096-byte 
physical blocks
Sep 22 00:51:12 server001 kernel: sdl: detected capacity change from 
3298534883328 to 4398046511104
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: port group 3e8 
state N non-preferred supports TolUsNA
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: Capacity data has changed
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: Inquiry data has changed
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: supports 
implicit TPGS
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: device 
t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e8 rel port 5
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: [sdc] 8589934592 
512-byte logical blocks: (4.40 TB/4.00 TiB)
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: [sdc] 4096-byte 
physical blocks
Sep 22 00:51:18 server001 kernel: sdc: detected capacity change from 
3298534883328 to 4398046511104
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: port group 3e8 
state N non-preferred supports TolUsNA
Sep 22 00:52:09 server001 kernel: sd 16:0:0:1: Power-on or device 
reset occurred
Sep 22 00:52:09 server001 kernel: sd 16:0:0:1: alua: port group 3e9 
state A non-preferred supports TolUsNA
Sep 22 00:52:09 server001 kernel: sd 17:0:0:1: Power-on or device 
reset occurred

But then it kept doing resets:
Sep 22 00:54:39 server001 kernel: sd 16:0:0:1: Power-on or device 
reset occurred
Sep 22 00:54:39 server001 kernel: sd 16:0:0:1: alua: port group 3e9 
state A non-preferred supports TolUsNA
Sep 22 00:54:39 server001 kernel: sd 17:0:0:1: Power-on or device 
reset occurred
Sep 22 00:54:39 server001 kernel: sd 17:0:0:1: alua: port group 3e9 
state A non-preferred supports TolUsNA
Sep 22 00:54:42 server001 kernel: sd 15:0:0:1: Power-on or device 
reset occurred
Sep 22 00:54:42 server001 kernel: sd 15:0:0:1: alua: port group 3e8 
state N non-preferred supports TolUsNA

This caused some multipath failovers until it stopped after ~10 minutes.

We do use ALUA multipath:
3600a098038305663785d505652713446 dm-15 NETAPP,LUN C-Mode
size=4.0T features='3 queue_if_no_path pg_init_retries 50' 
hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 16:0:0:1 sdf 8:80  active ready running
| `- 17:0:0:1 sdi 8:128 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 15:0:0:1 sdc 8:32  active ready running
  `- 18:0:0:1 sdl 8:176 active ready running

Who is sending the Power-on or device reset?
Is that the SAN?
Or does the client trigger a reset (for which reason then?)?
The LUN is attachted to multiple servers (all CentOS 8), and all 
showed the same resets.

It would be nice to find out what caused this!

Thanks for having a look :)
Jean-Louis