Hi All,
Last week we've had the same issue again, but luckily we did some more
debugging actions when it occurred.
And it looks like a NetApp issue but also partly a kernel issue.
Let me describe:
Jan 23 13:36:57 srv001 kernel: sd 15:0:0:0: Capacity data has changed
Jan 23 13:37:57 srv001 kernel: sd 15:0:0:0: Power-on or device reset
occurred
We do a LUN resize, and exactly 1 minute later -> A 'Power-on or device
reset' event.
Now what it looks like, is that the SAN here did not send some
confirmation for a read/write, and the kernel tried to abort it, but failed.
So it ended up sending a Logical Unit Reset to recover.
-> So this seems to be clearly a SAN bug, as it should always confirm
read/write.
But then:
When the LUR was send by a host (20 hosts are connected to the same LUN
here), the following seems to happen:
- Client (linux host) sends a write x
- Client (linux host) sends a write y
- SAN respons with a check condition 0x29 (Power-on or device reset) on
write y
- Client (linux host) sends a NOP Out after 30 seconds
- NetApp responds with a NOP In
- Client sends an abort for write x after 1 minute (as it was still not
confirmed from the netapp side)
- NetApp responds with '0x01' (Task not in set)
- Client sends a LUR to the NetApp to reset again, as it still didn't
know what happend with write x and could not abort it
- The LUR completes, and causes the same issue again on other hosts.
As the NetApp seems to skip write confirmations during a reset, we end
up in a reset storm because write confirmations get skipped, and a new
reset is being send again.
And as this happens on all the 20 hosts, this causes an endless reset storm.
Now while I think the NetApp should never skip confirmations of
read/writes, I think the kernel should remove all non-confirmed
writes/reads on a LUR event?
This is what SAM-2
(https://www.cs.cmu.edu/afs/club/usr/jhutz/project/Archives/scsi/sam2r24.pdf)
specification tells about it:
To process a logical unit reset the logical unit shall:
a)Abort all tasks as described in 5.7;
If the kernel on all hosts would remove the unconfirmed writes/reads on
all hosts after a 0x29, it would never send an abort (which fails), and
then no more LUR's would be send by the kernel.
And everything would recover correctly after the first LUR.
It would be great if we could improve this :)
There is an issue open on NetApp side also for this!
Thanks
Jean-Louis
On 23/09/2020 11:17, Jean-Louis Dupond wrote:
Hi,
Last week some action that we do regularly caused some issues.
00:50:31 CEST -> We resized a iSCSI LUN on a SAN from 3TB -> 4TB.
The clients did detect the change fine, and resized it devices:
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: Capacity data has changed
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: Inquiry data has changed
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: supports
implicit TPGS
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: device
t10.NETAPP LUN 80Vcx]PVRq4F port group 3e9 rel port 8
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: Capacity data has changed
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: [sdf] 8589934592
512-byte logical blocks: (4.40 TB/4.00 TiB)
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: [sdf] 4096-byte
physical blocks
Sep 22 00:51:07 server001 kernel: sdf: detected capacity change from
3298534883328 to 4398046511104
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: port group 3e9
state A non-preferred supports TolUsNA
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: Inquiry data has changed
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: supports
implicit TPGS
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: device
t10.NETAPP LUN 80Vcx]PVRq4F port group 3e9 rel port 7
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: [sdi] 8589934592
512-byte logical blocks: (4.40 TB/4.00 TiB)
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: [sdi] 4096-byte
physical blocks
Sep 22 00:51:07 server001 kernel: sdi: detected capacity change from
3298534883328 to 4398046511104
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: port group 3e9
state A non-preferred supports TolUsNA
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: Capacity data has changed
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: Inquiry data has changed
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: supports
implicit TPGS
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: device
t10.NETAPP LUN 80Vcx]PVRq4F port group 3e8 rel port 6
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: [sdl] 8589934592
512-byte logical blocks: (4.40 TB/4.00 TiB)
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: [sdl] 4096-byte
physical blocks
Sep 22 00:51:12 server001 kernel: sdl: detected capacity change from
3298534883328 to 4398046511104
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: port group 3e8
state N non-preferred supports TolUsNA
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: Capacity data has changed
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: Inquiry data has changed
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: supports
implicit TPGS
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: device
t10.NETAPP LUN 80Vcx]PVRq4F port group 3e8 rel port 5
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: [sdc] 8589934592
512-byte logical blocks: (4.40 TB/4.00 TiB)
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: [sdc] 4096-byte
physical blocks
Sep 22 00:51:18 server001 kernel: sdc: detected capacity change from
3298534883328 to 4398046511104
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: port group 3e8
state N non-preferred supports TolUsNA
Sep 22 00:52:09 server001 kernel: sd 16:0:0:1: Power-on or device
reset occurred
Sep 22 00:52:09 server001 kernel: sd 16:0:0:1: alua: port group 3e9
state A non-preferred supports TolUsNA
Sep 22 00:52:09 server001 kernel: sd 17:0:0:1: Power-on or device
reset occurred
But then it kept doing resets:
Sep 22 00:54:39 server001 kernel: sd 16:0:0:1: Power-on or device
reset occurred
Sep 22 00:54:39 server001 kernel: sd 16:0:0:1: alua: port group 3e9
state A non-preferred supports TolUsNA
Sep 22 00:54:39 server001 kernel: sd 17:0:0:1: Power-on or device
reset occurred
Sep 22 00:54:39 server001 kernel: sd 17:0:0:1: alua: port group 3e9
state A non-preferred supports TolUsNA
Sep 22 00:54:42 server001 kernel: sd 15:0:0:1: Power-on or device
reset occurred
Sep 22 00:54:42 server001 kernel: sd 15:0:0:1: alua: port group 3e8
state N non-preferred supports TolUsNA
This caused some multipath failovers until it stopped after ~10 minutes.
We do use ALUA multipath:
3600a098038305663785d505652713446 dm-15 NETAPP,LUN C-Mode
size=4.0T features='3 queue_if_no_path pg_init_retries 50'
hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 16:0:0:1 sdf 8:80 active ready running
| `- 17:0:0:1 sdi 8:128 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
|- 15:0:0:1 sdc 8:32 active ready running
`- 18:0:0:1 sdl 8:176 active ready running
Who is sending the Power-on or device reset?
Is that the SAN?
Or does the client trigger a reset (for which reason then?)?
The LUN is attachted to multiple servers (all CentOS 8), and all
showed the same resets.
It would be nice to find out what caused this!
Thanks for having a look :)
Jean-Louis