Re: "Power-on or device reset occurred" after a LUN resize

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All,

Last week we've had the same issue again, but luckily we did some more debugging actions when it occurred.
And it looks like a NetApp issue but also partly a kernel issue.

Let me describe:

Jan 23 13:36:57 srv001 kernel: sd 15:0:0:0: Capacity data has changed
Jan 23 13:37:57 srv001 kernel: sd 15:0:0:0: Power-on or device reset occurred
We do a LUN resize, and exactly 1 minute later -> A 'Power-on or device reset' event. Now what it looks like, is that the SAN here did not send some confirmation for a read/write, and the kernel tried to abort it, but failed.
So it ended up sending a Logical Unit Reset to recover.
-> So this seems to be clearly a SAN bug, as it should always confirm read/write.

But then:
When the LUR was send by a host (20 hosts are connected to the same LUN here), the following seems to happen:
- Client (linux host) sends a write x
- Client (linux host) sends a write y
- SAN respons with a check condition 0x29 (Power-on or device reset) on write y
- Client (linux host) sends a NOP Out after 30 seconds
- NetApp responds with a NOP In
- Client sends an abort for write x after 1 minute (as it was still not confirmed from the netapp side)
- NetApp responds with '0x01' (Task not in set)
- Client sends a LUR to the NetApp to reset again, as it still didn't know what happend with write x and could not abort it
- The LUR completes, and causes the same issue again on other hosts.

As the NetApp seems to skip write confirmations during a reset, we end up in a reset storm because write confirmations get skipped, and a new reset is being send again.
And as this happens on all the 20 hosts, this causes an endless reset storm.

Now while I think the NetApp should never skip confirmations of read/writes, I think the kernel should remove all non-confirmed writes/reads on a LUR event? This is what SAM-2 (https://www.cs.cmu.edu/afs/club/usr/jhutz/project/Archives/scsi/sam2r24.pdf) specification tells about it:
To process a logical unit reset the logical unit shall:
a)Abort all tasks as described in 5.7;

If the kernel on all hosts would remove the unconfirmed writes/reads on all hosts after a 0x29, it would never send an abort (which fails), and then no more LUR's would be send by the kernel.
And everything would recover correctly after the first LUR.

It would be great if we could improve this :)
There is an issue open on NetApp side also for this!

Thanks
Jean-Louis

On 23/09/2020 11:17, Jean-Louis Dupond wrote:
Hi,

Last week some action that we do regularly caused some issues.

00:50:31 CEST -> We resized a iSCSI LUN on a SAN from 3TB -> 4TB.

The clients did detect the change fine, and resized it devices:

Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: Capacity data has changed
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: Inquiry data has changed
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: supports implicit TPGS Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: device t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e9 rel port 8
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: Capacity data has changed
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: [sdf] 8589934592 512-byte logical blocks: (4.40 TB/4.00 TiB) Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: [sdf] 4096-byte physical blocks Sep 22 00:51:07 server001 kernel: sdf: detected capacity change from 3298534883328 to 4398046511104 Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: port group 3e9 state A non-preferred supports TolUsNA
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: Inquiry data has changed
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: supports implicit TPGS Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: device t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e9 rel port 7 Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: [sdi] 8589934592 512-byte logical blocks: (4.40 TB/4.00 TiB) Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: [sdi] 4096-byte physical blocks Sep 22 00:51:07 server001 kernel: sdi: detected capacity change from 3298534883328 to 4398046511104 Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: port group 3e9 state A non-preferred supports TolUsNA
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: Capacity data has changed
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: Inquiry data has changed
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: supports implicit TPGS Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: device t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e8 rel port 6 Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: [sdl] 8589934592 512-byte logical blocks: (4.40 TB/4.00 TiB) Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: [sdl] 4096-byte physical blocks Sep 22 00:51:12 server001 kernel: sdl: detected capacity change from 3298534883328 to 4398046511104 Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: port group 3e8 state N non-preferred supports TolUsNA
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: Capacity data has changed
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: Inquiry data has changed
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: supports implicit TPGS Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: device t10.NETAPP   LUN 80Vcx]PVRq4F        port group 3e8 rel port 5 Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: [sdc] 8589934592 512-byte logical blocks: (4.40 TB/4.00 TiB) Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: [sdc] 4096-byte physical blocks Sep 22 00:51:18 server001 kernel: sdc: detected capacity change from 3298534883328 to 4398046511104 Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: port group 3e8 state N non-preferred supports TolUsNA Sep 22 00:52:09 server001 kernel: sd 16:0:0:1: Power-on or device reset occurred Sep 22 00:52:09 server001 kernel: sd 16:0:0:1: alua: port group 3e9 state A non-preferred supports TolUsNA Sep 22 00:52:09 server001 kernel: sd 17:0:0:1: Power-on or device reset occurred

But then it kept doing resets:
Sep 22 00:54:39 server001 kernel: sd 16:0:0:1: Power-on or device reset occurred Sep 22 00:54:39 server001 kernel: sd 16:0:0:1: alua: port group 3e9 state A non-preferred supports TolUsNA Sep 22 00:54:39 server001 kernel: sd 17:0:0:1: Power-on or device reset occurred Sep 22 00:54:39 server001 kernel: sd 17:0:0:1: alua: port group 3e9 state A non-preferred supports TolUsNA Sep 22 00:54:42 server001 kernel: sd 15:0:0:1: Power-on or device reset occurred Sep 22 00:54:42 server001 kernel: sd 15:0:0:1: alua: port group 3e8 state N non-preferred supports TolUsNA

This caused some multipath failovers until it stopped after ~10 minutes.

We do use ALUA multipath:
3600a098038305663785d505652713446 dm-15 NETAPP,LUN C-Mode
size=4.0T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 16:0:0:1 sdf 8:80  active ready running
| `- 17:0:0:1 sdi 8:128 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 15:0:0:1 sdc 8:32  active ready running
  `- 18:0:0:1 sdl 8:176 active ready running


Who is sending the Power-on or device reset?
Is that the SAN?
Or does the client trigger a reset (for which reason then?)?
The LUN is attachted to multiple servers (all CentOS 8), and all showed the same resets.

It would be nice to find out what caused this!

Thanks for having a look :)
Jean-Louis





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux