Re: Problems with VMware: Detected NON_EXISTENT_LUN Access

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Mon, 01 May 2017 21:50:32 -0700

Hello,

On Tue, 2017-04-18 at 20:11 +0200, o@xxxxxxxxxx wrote:
> Hi everyone,
> 
> We have encountered a problem with LIO ISCSI and VMware ESXi 5.5 that is preventing us from using LIO in production.
> The test setup works and performs reasonably well. Unfortunately every 1 to 2 months everything hosted on the LIO target crashes and the ESX management tools become unresponsive until the storage server is rebooted.
> Even though the ESXi nodes are not responding (see logs below), I can discover, connect and mount the problematic LUN on any other Server.
> 
> Error messages on the storage server:
> [5094061.802861] iSCSI/iqn.1998-01.com.vmware:esx03-0db0bf2c: Unsupported SCSI Opcode 0x4d, sending CHECK_CONDITION.
> [5094086.496004] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000003e
> [5094087.054555] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000003e
> [5094115.472850] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000003f
> [5094116.763556] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x000000aa
> [5094130.725901] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x000000ab
> [5094144.445284] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000040
> [5094145.228081] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x000000ab
> 
> The "Unsupported SCSI Opcode 0x4d" also occurs during normal operation.
> 
> 
> Error messages on ESX(shortened):
> 2017-04-18T17:50:17.434Z cpu14:33518)WARNING: iscsi_vmk: iscsivmk_StartConnection: vmhba32:CH:0 T:6 CN:0: iSCSI connection is being marked "ONLINE"
> 2017-04-18T17:50:17.434Z cpu14:33518)WARNING: iscsi_vmk: iscsivmk_StartConnection: Sess [ISID: 00023d000002 TARGET: iqn.2016-11.local.xxx:vmstore01 TPGT: 1 TSIH: 0]
> 2017-04-18T17:50:17.434Z cpu14:33518)WARNING: iscsi_vmk: iscsivmk_StartConnection: Conn [CID: 0 L: 10.15.24.135:40650 R: 10.15.24.65:3260]
> 2017-04-18T17:50:18.976Z cpu13:104435914)HBX: 2959: Waiting for timed out [HB state abcdef02 offset 3928064 gen 55 stampUS 43972608588966 uuid 56573462-6e6fe026-3aa9-1cc1de771070 jrnl <FB 1722600> drv 14.60] on vol 'vmstore01-lun0'
> 2017-04-18T17:50:18.976Z cpu14:104107012)HBX: 2959: Waiting for timed out [HB state abcdef02 offset 3928064 gen 55 stampUS 43972608588966 uuid 56573462-6e6fe026-3aa9-1cc1de771070 jrnl <FB 1722600> drv 14.60] on vol 'vmstore01-lun0'
> 2017-04-18T17:50:28.098Z cpu10:33518)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba32:CH:0 T:6 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
> 2017-04-18T17:50:28.098Z cpu9:33518)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000002 TARGET: iqn.2016-11.local.xxx:vmstore01 TPGT: 1 TSIH: 0]
> 2017-04-18T17:50:28.098Z cpu9:33518)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 10.15.24.135:40650 R: 10.15.24.65:3260]
> 2017-04-18T17:50:28.098Z cpu11:33520)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.6001405e7e34bd545064ac286fee4e07" state in doubt; requested fast path state update...
> 2017-04-18T17:50:28.098Z cpu9:33518)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba32:CH:0 T:6 L:0 : Task mgmt "Abort Task" with itt=0x13e81f1f (refITT=0x13e81f1e) timed out.
> 2017-04-18T17:50:28.098Z cpu11:33520)ScsiDeviceIO: 2325: Cmd(0x413683ad8cc0) 0x89, CmdSN 0xa12cd4 from world 32824 to dev "naa.6001405e7e34bd545064ac286fee4e07" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x5 0x25 0x0.
> 2017-04-18T17:50:28.099Z cpu12:115955886)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:656: Path "vmhba32:C0:T6:L0" (UP) command 0xa3 failed with status Timeout. H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
> 2017-04-18T17:50:28.183Z cpu11:117253905)NMP: nmp_ThrottleLogForDevice:2349: Cmd 0x89 (0x413683ad8cc0, 32824) to dev "naa.6001405e7e34bd545064ac286fee4e07" on path "vmhba32:C0:T6:L0" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x5 0x25 0x0. Act:EVAL
> 2017-04-18T17:50:29.054Z cpu11:112429340)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.6001405e7e34bd545064ac286fee4e07" state in doubt; requested fast path state update...
> 2017-04-18T17:50:29.165Z cpu11:86245289)NMP: nmp_ThrottleLogForDevice:2349: Cmd 0x89 (0x413683ad8cc0, 32824) to dev "naa.6001405e7e34bd545064ac286fee4e07" on path "vmhba32:C0:T6:L0" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x5 0x25 0x0. Act:EVAL
> 2017-04-18T17:50:30.896Z cpu8:33518)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4109d7a54110 network resource pool netsched.pools.persist.iscsi associated
> 2017-04-18T17:50:30.896Z cpu8:33518)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4109d7a54110 network tracker id 15 tracker.iSCSI.10.15.24.65 associated
> 2017-04-18T17:50:30.987Z cpu12:86245285)NMP: nmp_ThrottleLogForDevice:2349: Cmd 0x28 (0x413684164580, 32823) to dev "naa.6001405e7e34bd545064ac286fee4e07" on path "vmhba32:C0:T6:L0" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
> 2017-04-18T17:50:31.059Z cpu12:112429340)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.6001405e7e34bd545064ac286fee4e07" state in doubt; requested fast path state update...
> 
> 
> Software:
> Ubuntu Server 16.04 kernel 4.4.0-62-generic
> backstore: iblock (LVM lv on HW raid)
> 
> 
> Hardware:
> LSI Megaraid 9271-4i
> Supermicro X9DR3-F
> 64G ECC RAM
> Intel Corporation 82599ES 10G network
> 
> 
> The storage server is still in the "failed" state. I can provide logs, remote access or anything else that might help resolve this issue. 
> 
> Other people with this problem: https://communities.vmware.com/thread/543819

The ESX HB timeouts your observing is a ESX host side issue known as
'ATS heartbeat bug', which effects all versions of ESX v5.5u2 and above
with VMFS5 for all targets with VAAI (namely AtomicTestandSet primitive)
enabled.

Here are a few pointers for background:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956
http://cormachogan.com/2015/04/17/heads-up-ats-miscompare-detected-between-test-and-set-hb-images/ http://www.thevirtualist.org/alert-application-outages-using-vaai-ats-on-vsphere-5-5-update2-vsphere-6-0/
https://www-304.ibm.com/support/docview.wss?uid=ssg1S1005201
http://h20565.www2.hpe.com/hpsc/doc/public/display?sp4ts.oid=75953&docId=mmr_sf-EN_US000005979&lang=en-us&cc=us&docLocale=en_US
https://community.emc.com/docs/DOC-52756

Note *not* a LIO target specific issue, but a well-known ESX host side
issue that requires users to manually disable ATS heartbeat on all your
ESX hosts using VAAI. 

Note disabling ATS heartbeat is absolutely required in order to get a
stable setup.   Note most vendors with a vCenter Plugin do this
automatically.

The ESX side instructions for doing this are:

# esxcli system settings advanced set -i 0 -o /VMFS3/UseATSForHBOnVMFS5
# esxcli system settings advanced list -o /VMFS3/UseATSForHBonVMFS5
   Path: /VMFS3/UseATSForHBOnVMFS5
   Type: integer
   Int Value: 0
   Default Int Value: 1
   Min Value: 0
   Max Value: 1
   String Value:
   Default String Value:
   Valid Characters:
   Description: Use ATS for HB on ATS supported VMFS5 volumes

Beyond that, I'm not sure how old the v4.4 kernel on Ubuntu is, but I'd
recommend getting the latest v4.4.64 from below to pick up the latest
LIO bug-fixes as well:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/log/?h=linux-4.4.y

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html