Re: Lock errors in iscsi gateway

Mike Christie <mchristi@xxxxxxxxxx> · Wed, 29 Apr 2020 10:47:24 -0500

On 4/29/20 2:11 AM, Simone Lazzaris wrote:
> In data martedì 28 aprile 2020 18:41:27 CEST, Mike Christie ha scritto:
> 
>  
> 
>> Could you send me:
> 
>>
> 
>> 1. The /var/log/messages for the initiator when you do IO and see those
> 
>> lock messages.
> 
>  
> 
> On the initiator (XenServer 7.1 which is based on CentOS AFAIK) the
> /var/log/messages is empty.
> 
> I (sporadicly) see:
> 
> Apr 29 09:00:36 xs-n1 systemd[1]: Starting Multipath Count Service...
> 
> Apr 29 09:00:36 xs-n1 systemd[1]: Started Multipath Count Service.
> 
> Apr 29 09:00:36 xs-n1 systemd[1]: Started Session 146 of user root.
> 
> Apr 29 09:00:36 xs-n1 systemd[1]: Starting Session 146 of user root.
> 
> Apr 29 09:00:40 xs-n1 multipathd: dm-3: remove map (uevent)
> 
> Apr 29 09:00:40 xs-n1 multipathd: dm-3: devmap not registered, can't remove
> 
> Apr 29 09:00:40 xs-n1 multipathd: dm-3: remove map (uevent)
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="PBD.get_all_records"];
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="host.get_uuid"];
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="host.get_name_label"];
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="host.get_uuid"];
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="host.get_name_label"];
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="host.get_uuid"];
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="host.get_name_label"];
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="host.get_uuid"];
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="host.get_name_label"];
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="host.get_uuid"];
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="host.get_name_label"];
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="host.get_uuid"];
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="host.get_name_label"];
> 
> Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
> mpathalert=>xapi [label="host.get_all_records"];
> 
>  
> 
>  
> 
>> 2. The output of
> 
>>
> 
>> From one of the gateways:
> 
>> # gwcli ls
> 
>>
> 
> Attached (gwcli.txt)
> 
>> From the initiator node you send the /var/log/messages for:
> 
>> # iscsiadm -m session -P 3
> 
>  
> 
> attacched (iscsi-session.txt)
> 
>  
> 
>> # multipath -ll
> 
>>
> 
>  
> 
> 36001405d7480e5f84b94ab19ebeebd6c dm-0 LIO-ORG ,TCMU device    
> 
> size=3.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
> 
> |-+- policy='queue-length 0' prio=50 status=active
> 
> | `- 2:0:0:0 sdc 8:32 active ready running
> 
> `-+- policy='queue-length 0' prio=10 status=enabled
> 
>   `- 3:0:0:0 sdb 8:16 active ready running
> 
>  
> 
>> 3. version info:
> 
>>
> 
>> # uname -a
> 
>  
> 
> On the Initiator:
> 
> Linux xs-n1 4.4.0+2 #1 SMP Thu Jun 15 16:38:02 UTC 2017 x86_64 x86_64
> x86_64 GNU/Linux
> 
>  
> 
> On the Target:
> 
> Linux iscsi1 4.18.0-147.8.1.el8_1.x86_64 #1 SMP Thu Apr 9 13:49:54 UTC
> 2020 x86_64 x86_64 x86_64 GNU/Linux
> 
>  
> 
>>
> 
>> If you using rpm do:
> 
>> # rpm -q ceph-iscsi
> 
>> # rpm -q tcmu-runner
> 
>> # rpm -q python-rtslib
> 
>>
> 
> No, I've installed them from source on the target

What version of tcmu-runner did you use? Was it one of the 1.4 or 1.5
releases or from the github master branch?

There was a bug in the older 1.4 release where due to a linux kernel
initiator side change the behavior for an error code we used went from
retrying for up to 5 minutes to 5 times. The 5 retries were then used in
less than a second, so we could see the issue you are seeing.

> 
>> To map that to an iscsi gateway then you can do the following.
> 
>>
> 
>> If sdb is the AO one, then run
> 
>>
> 
>> iscsiadm -m session -P 3
> 
>>
> 
>> Here you can see the sdXYZ name to iscsi session mapping. The iscsi
> 
>> session/connection's target IP address from that command should match to
> 
>> the gateway that is listed as the "owner" of the LUN in the "gwcli ls"
> 
>> output.
> 
>  
> 
> I see... thanks for the hint.
> 
>  
> 
> I've done a test: I've unmapped all the drive, then mapped the first
> gateway (iscsi1) on all the nodes, waited, then mapped the second
> gateway, to be sure that all the nodes would see the first node as the
> active/master 
> 
> Now things seems a little better in "normal" vm use: I only see the
> "Cannot send after transport endpoint shutdown." on the secondary target
> node.
> 
>  
> 
> I do see some hopping between the nodes when importing a disk drive, but
> at this point I'm starting to suspect some strange activity from the Xen
> infrastructure in that circumstance.
> 
>  
> 
> -- 
> 
> *Simone Lazzaris*
> 
>  *Qcom S.p.A. a socio unico*
> 
>  simone.lazzaris@xxxxxxx <mailto:simone.lazzaris@xxxxxxx> | www.qcom.it
> <https://www.qcom.it>
> 
>  * LinkedIn <https://www.linkedin.com/company/qcom-spa>* | *Facebook*
> <http://www.facebook.com/qcomspa>
> 
>   
> 
>  
> 
>  
> 
>  
> 
>  
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx