Hi,
Last week some action that we do regularly caused some issues.
00:50:31 CEST -> We resized a iSCSI LUN on a SAN from 3TB -> 4TB.
The clients did detect the change fine, and resized it devices:
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: Capacity data has changed
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: Inquiry data has changed
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: supports implicit
TPGS
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: device
t10.NETAPP LUN 80Vcx]PVRq4F port group 3e9 rel port 8
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: Capacity data has changed
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: [sdf] 8589934592
512-byte logical blocks: (4.40 TB/4.00 TiB)
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: [sdf] 4096-byte
physical blocks
Sep 22 00:51:07 server001 kernel: sdf: detected capacity change from
3298534883328 to 4398046511104
Sep 22 00:51:07 server001 kernel: sd 16:0:0:1: alua: port group 3e9
state A non-preferred supports TolUsNA
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: Inquiry data has changed
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: supports implicit
TPGS
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: device
t10.NETAPP LUN 80Vcx]PVRq4F port group 3e9 rel port 7
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: [sdi] 8589934592
512-byte logical blocks: (4.40 TB/4.00 TiB)
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: [sdi] 4096-byte
physical blocks
Sep 22 00:51:07 server001 kernel: sdi: detected capacity change from
3298534883328 to 4398046511104
Sep 22 00:51:07 server001 kernel: sd 17:0:0:1: alua: port group 3e9
state A non-preferred supports TolUsNA
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: Capacity data has changed
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: Inquiry data has changed
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: supports implicit
TPGS
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: device
t10.NETAPP LUN 80Vcx]PVRq4F port group 3e8 rel port 6
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: [sdl] 8589934592
512-byte logical blocks: (4.40 TB/4.00 TiB)
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: [sdl] 4096-byte
physical blocks
Sep 22 00:51:12 server001 kernel: sdl: detected capacity change from
3298534883328 to 4398046511104
Sep 22 00:51:12 server001 kernel: sd 18:0:0:1: alua: port group 3e8
state N non-preferred supports TolUsNA
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: Capacity data has changed
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: Inquiry data has changed
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: supports implicit
TPGS
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: device
t10.NETAPP LUN 80Vcx]PVRq4F port group 3e8 rel port 5
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: [sdc] 8589934592
512-byte logical blocks: (4.40 TB/4.00 TiB)
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: [sdc] 4096-byte
physical blocks
Sep 22 00:51:18 server001 kernel: sdc: detected capacity change from
3298534883328 to 4398046511104
Sep 22 00:51:18 server001 kernel: sd 15:0:0:1: alua: port group 3e8
state N non-preferred supports TolUsNA
Sep 22 00:52:09 server001 kernel: sd 16:0:0:1: Power-on or device
reset occurred
Sep 22 00:52:09 server001 kernel: sd 16:0:0:1: alua: port group 3e9
state A non-preferred supports TolUsNA
Sep 22 00:52:09 server001 kernel: sd 17:0:0:1: Power-on or device
reset occurred
But then it kept doing resets:
Sep 22 00:54:39 server001 kernel: sd 16:0:0:1: Power-on or device
reset occurred
Sep 22 00:54:39 server001 kernel: sd 16:0:0:1: alua: port group 3e9
state A non-preferred supports TolUsNA
Sep 22 00:54:39 server001 kernel: sd 17:0:0:1: Power-on or device
reset occurred
Sep 22 00:54:39 server001 kernel: sd 17:0:0:1: alua: port group 3e9
state A non-preferred supports TolUsNA
Sep 22 00:54:42 server001 kernel: sd 15:0:0:1: Power-on or device
reset occurred
Sep 22 00:54:42 server001 kernel: sd 15:0:0:1: alua: port group 3e8
state N non-preferred supports TolUsNA
This caused some multipath failovers until it stopped after ~10 minutes.
We do use ALUA multipath:
3600a098038305663785d505652713446 dm-15 NETAPP,LUN C-Mode
size=4.0T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 16:0:0:1 sdf 8:80 active ready running
| `- 17:0:0:1 sdi 8:128 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
|- 15:0:0:1 sdc 8:32 active ready running
`- 18:0:0:1 sdl 8:176 active ready running
Who is sending the Power-on or device reset?
Is that the SAN?
Or does the client trigger a reset (for which reason then?)?
The LUN is attachted to multiple servers (all CentOS 8), and all showed
the same resets.
It would be nice to find out what caused this!
Thanks for having a look :)
Jean-Louis