Re: slow using ISCSI - Help-me

Mike Christie <mchristi@xxxxxxxxxx> · Fri, 14 Feb 2020 10:36:14 -0600

On 02/14/2020 10:25 AM, Mike Christie wrote:
> On 02/13/2020 08:52 PM, Gesiel Galvão Bernardes wrote:
>> Hi
>>
>> Em dom., 9 de fev. de 2020 às 18:27, Mike Christie <mchristi@xxxxxxxxxx
>> <mailto:mchristi@xxxxxxxxxx>> escreveu:
>>
>>     On 02/08/2020 11:34 PM, Gesiel Galvão Bernardes wrote:
>>     > Hi,
>>     >
>>     > Em qui., 6 de fev. de 2020 às 18:56, Mike Christie
>>     <mchristi@xxxxxxxxxx <mailto:mchristi@xxxxxxxxxx>
>>     > <mailto:mchristi@xxxxxxxxxx <mailto:mchristi@xxxxxxxxxx>>> escreveu:
>>     >
>>     >     On 02/05/2020 07:03 AM, Gesiel Galvão Bernardes wrote:
>>     >     > Em dom., 2 de fev. de 2020 às 00:37, Gesiel Galvão Bernardes
>>     >     > <gesiel.bernardes@xxxxxxxxx
>>     <mailto:gesiel.bernardes@xxxxxxxxx>
>>     <mailto:gesiel.bernardes@xxxxxxxxx <mailto:gesiel.bernardes@xxxxxxxxx>>
>>     >     <mailto:gesiel.bernardes@xxxxxxxxx
>>     <mailto:gesiel.bernardes@xxxxxxxxx>
>>     >     <mailto:gesiel.bernardes@xxxxxxxxx
>>     <mailto:gesiel.bernardes@xxxxxxxxx>>>> escreveu:
>>     >     >
>>     >     >     Hi,
>>     >     >
>>     >     >     Just now was possible continue this. Below is the
>>     information
>>     >     >     required. Thanks advan
>>     >
>>     >
>>     >     Hey, sorry for the late reply. I just back from PTO.
>>     >
>>     >     >
>>     >     >     esxcli storage nmp device list -d
>>     >     naa.6001405ba48e0b99e4c418ca13506c8e
>>     >     >     naa.6001405ba48e0b99e4c418ca13506c8e
>>     >     >        Device Display Name: LIO-ORG iSCSI Disk
>>     >     >     (naa.6001405ba48e0b99e4c418ca13506c8e)
>>     >     >        Storage Array Type: VMW_SATP_ALUA
>>     >     >        Storage Array Type Device Config: {implicit_support=on;
>>     >     >     explicit_support=off; explicit_allow=on; alua_followover=on;
>>     >     >     action_OnRetryErrors=on; {TPG_id=1,TPG_state=ANO}}
>>     >     >        Path Selection Policy: VMW_PSP_MRU
>>     >     >        Path Selection Policy Device Config: Current
>>     >     Path=vmhba68:C0:T0:L0
>>     >     >        Path Selection Policy Device Custom Config:
>>     >     >        Working Paths: vmhba68:C0:T0:L0
>>     >     >        Is USB: false
>>     >
>>     >     ........
>>     >
>>     >     >     Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x2 0x4 0xa.
>>     >     Act:FAILOVER
>>     >
>>     >
>>     >     Are you sure you are using tcmu-runner 1.4? Is that the actual
>>     daemon
>>     >     reversion running? Did you by any chance install the 1.4 rpm,
>>     but you/it
>>     >     did not restart the daemon? The error code above is returned
>>     in 1.3 and
>>     >     earlier.
>>     >
>>     >     You are probably hitting a combo of 2 issues.
>>     >
>>     >     We had only listed ESX 6.5 in the docs you probably saw, and
>>     in 6.7 the
>>     >     value of action_OnRetryErrors defaulted to on instead of off.
>>     You should
>>     >     set this back to off.
>>     >
>>     >     You should also upgrade to the current version of tcmu-runner
>>     1.5.x. It
>>     >     should fix the issue you are hitting, so non IO commands like
>>     inquiry,
>>     >     RTPG, etc are executed while failing over/back, so you would
>>     not hit the
>>     >     problem where path initialization and path testing IO is
>>     failed causing
>>     >     the path to marked as failed.
>>     >
>>     >
>>     > I updated tcmu-runner to 1.5.2, and change action_OnRetryErrors to
>>     off,
>>     > but the problem continue 😭 
>>     > 
>>     > Attached is vmkernel.log.
>>     >
>>
>>
>>     When you stopped the iscsi gw at around 2020-02-09T01:51:25.820Z, how
>>     many paths did your device have? Did:
>>
>>     esxcli storage nmp path list -d your_device
>>
>>     report only one path? Did
>>
>>     esxcli iscsi session connection list
>>
>>     show a iscsi connection to each gw?
>>
>> Hmmm, I believe the problem may be here. I verified that I was listing
>> only one GW for each path. So I ran a "rescan HBA" on VMware on both
>> ESX, now one of them lists the 3 (I added one more) gateways, but an ESX
>> host with the same configuration continues to list only one gateway. See
>> the different outputs:
>>
>>  [root@tcnvh7:~] esxcli iscsi session connection list
>> vmhba68,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,00023d000001,0
>>    Adapter: vmhba68
>>    Target: iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
>>    ISID: 00023d000001
>>    CID: 0
>>    DataDigest: NONE
>>    HeaderDigest: NONE
>>    IFMarker: false
>>    IFMarkerInterval: 0
>>    MaxRecvDataSegmentLength: 131072
>>    MaxTransmitDataSegmentLength: 262144
>>    OFMarker: false
>>    OFMarkerInterval: 0
>>    ConnectionAddress: 192.168.201.1
>>    RemoteAddress: 192.168.201.1
>>    LocalAddress: 192.168.201.107
>>    SessionCreateTime: 01/19/20 00:11:25
>>    ConnectionCreateTime: 01/19/20 00:11:25
>>    ConnectionStartTime: 02/13/20 23:03:10
>>    State: logged_in
>>
>> vmhba68,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,00023d000002,0
>>    Adapter: vmhba68
>>    Target: iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
>>    ISID: 00023d000002
>>    CID: 0
>>    DataDigest: NONE
>>    HeaderDigest: NONE
>>    IFMarker: false
>>    IFMarkerInterval: 0
>>    MaxRecvDataSegmentLength: 131072
>>    MaxTransmitDataSegmentLength: 262144
>>    OFMarker: false
>>    OFMarkerInterval: 0
>>    ConnectionAddress: 192.168.201.2
>>    RemoteAddress: 192.168.201.2
>>    LocalAddress: 192.168.201.107
>>    SessionCreateTime: 02/13/20 23:09:16
>>    ConnectionCreateTime: 02/13/20 23:09:16
>>    ConnectionStartTime: 02/13/20 23:09:16
>>    State: logged_in
>>
>> vmhba68,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,00023d000003,0
>>    Adapter: vmhba68
>>    Target: iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
>>    ISID: 00023d000003
>>    CID: 0
>>    DataDigest: NONE
>>    HeaderDigest: NONE
>>    IFMarker: false
>>    IFMarkerInterval: 0
>>    MaxRecvDataSegmentLength: 131072
>>    MaxTransmitDataSegmentLength: 262144
>>    OFMarker: false
>>    OFMarkerInterval: 0
>>    ConnectionAddress: 192.168.201.3
>>    RemoteAddress: 192.168.201.3
>>    LocalAddress: 192.168.201.107
>>    SessionCreateTime: 02/13/20 23:09:16
>>    ConnectionCreateTime: 02/13/20 23:09:16
>>    ConnectionStartTime: 02/13/20 23:09:16
>>    State: logged_in
>>
>> =====
>> [root@tcnvh8:~] esxcli iscsi session connection list
>> vmhba68,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,00023d000001,0
>>    Adapter: vmhba68
>>    Target: iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
>>    ISID: 00023d000001
>>    CID: 0
>>    DataDigest: NONE
>>    HeaderDigest: NONE
>>    IFMarker: false
>>    IFMarkerInterval: 0
>>    MaxRecvDataSegmentLength: 131072
>>    MaxTransmitDataSegmentLength: 262144
>>    OFMarker: false
>>    OFMarkerInterval: 0
>>    ConnectionAddress: 192.168.201.1
>>    RemoteAddress: 192.168.201.1
>>    LocalAddress: 192.168.201.108
>>    SessionCreateTime: 01/12/20 02:53:53
>>    ConnectionCreateTime: 01/12/20 02:53:53
>>    ConnectionStartTime: 02/13/20 23:06:40
>>    State: logged_in
>>
>> Is that the problem? Any ideas on how to proceed from here?
>>
> 
> Yes. Normally, you would have the connection already created, and when
> one path/gateway goes down, then the multipath layer will switch to
> another path. When the path/gateway comes back up, the initiator side's
> iscsi layer will reconnect automatically and the multipath layer will
> re-setup the path structure, so it can failback if its a higher priority
> path or failover later if other paths go down.
> 
> Something happened with the automatic path connection process on that
> node. We know it works for that one gateway you brought up/down. For the
> other gateways I would check:
> 
> 1. Check that all target portals are being discovered. In the GUI screen
> you entered in the discovery address, you should also see a list of all
> target portals that were found in the static section. Do you only see 1
> portal?
> 
> See here:
> 
> https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.storage.doc/GUID-66215AF3-2D81-4D1F-92D4-B9623FC1CB0E.html
> 

Oh yeah, make sure you check the basics. If after a rescan you are
seeing only the one portal at 192.168.201.1, then make sure from tcnvh8
you can ping the other addresses 192.168.201.3 and 192.168.201.2.

> 2. If you see all the portals then when you hit the rescan HBA button,
> do you see any errors on the target side in /var/log/messages? Maybe
> something about CHAP/login/auth errors?
> 
> What about in the /var/log/vmkernel.log on the initiator side? Any iscsi
> errors?
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx