Re: tcmu-runner crashing on 16.2.5

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thank you. This is exactly what I was looking for. 

If I’m understanding correctly, what gets listed as the “owner” is what gets advertised via ALUA as the primary path, but the lock owner indicates which gateway currently owns the lock for that image and is allowed to pass traffic for that LUN, correct? 

BTW - it appears there is some other kind of bug. I’m using cephadm for bringing the iscsi gateways up/down. Right now I only have two that are configured and ‘ceph orch ls’ only shows two as expected: 

[root@cxcto-c240-j27-01 ~]# ceph orch ls
NAME                               PORTS        RUNNING  REFRESHED  AGE  PLACEMENT
alertmanager                       ?:9093,9094      1/1  8m ago     2M   count:1
crash                                             15/15  8m ago     2M   *
grafana                            ?:3000           1/1  8m ago     2M   count:1
iscsi.iscsi                                         2/2  8m ago     81m  cxcto-c240-j27-02.cisco.com;cxcto-c240-j27-03.cisco.com
mgr                                                 2/2  8m ago     2M   count:2
mon                                                 5/5  8m ago     5d   cxcto-c240-j27-01.cisco.com;cxcto-c240-j27-06.cisco.com;cxcto-c240-j27-08.cisco.com;cxcto-c240-j27-10.cisco.com;cxcto-c240-j27-12.cisco.com
node-exporter                      ?:9100         15/15  8m ago     2M   *
osd.dashboard-admin-1622750977792                  0/15  -          2M   *
osd.dashboard-admin-1622751032319               326/341  8m ago     2M   *
prometheus                         ?:9095           1/1  8m ago     2M   count:1

However, the gwcli command is still showing the other two gateways which are no longer enabled anymore. Where does this list of gateways get stored? It appears that the two gateways that are no longer part of the cluster still appear as the owners of some of the LUNs: 

/iscsi-targets> ls
o- iscsi-targets ................................................................................. [DiscoveryAuth: CHAP, Targets: 3]
  o- iqn.2001-07.com.ceph:1622752075720 .................................................................. [Auth: CHAP, Gateways: 4]
  | o- disks ............................................................................................................ [Disks: 5]
  | | o- iscsi-pool-0001/iscsi-p0001-img-01 ........................................... [Owner: cxcto-c240-j27-02.cisco.com, Lun: 0]
  | | o- iscsi-pool-0001/iscsi-p0001-img-02 ........................................... [Owner: cxcto-c240-j27-04.cisco.com, Lun: 3]
  | | o- iscsi-pool-0003/iscsi-p0003-img-01 ........................................... [Owner: cxcto-c240-j27-03.cisco.com, Lun: 1]
  | | o- iscsi-pool-0003/iscsi-p0003-img-02 ........................................... [Owner: cxcto-c240-j27-05.cisco.com, Lun: 4]
  | | o- iscsi-pool-0005/iscsi-p0005-img-01 ........................................... [Owner: cxcto-c240-j27-02.cisco.com, Lun: 2]
  | o- gateways .............................................................................................. [Up: 2/4, Portals: 4]
  | | o- cxcto-c240-j27-02.cisco.com ......................................................................... [10.122.242.197 (UP)]
  | | o- cxcto-c240-j27-03.cisco.com ......................................................................... [10.122.242.198 (UP)]
  | | o- cxcto-c240-j27-04.cisco.com .................................................................... [10.122.242.199 (UNKNOWN)]
  | | o- cxcto-c240-j27-05.cisco.com .................................................................... [10.122.242.200 (UNKNOWN)]
  | o- host-groups .................................................................................................... [Groups : 0]
  | o- hosts ........................................................................................ [Auth: ACL_DISABLED, Hosts: 0]
  o- iqn.2001-07.com.ceph:1622752147345 .................................................................. [Auth: CHAP, Gateways: 4]
  | o- disks ............................................................................................................ [Disks: 5]
  | | o- iscsi-pool-0002/iscsi-p0002-img-01 ........................................... [Owner: cxcto-c240-j27-04.cisco.com, Lun: 0]
  | | o- iscsi-pool-0002/iscsi-p0002-img-02 ........................................... [Owner: cxcto-c240-j27-02.cisco.com, Lun: 3]
  | | o- iscsi-pool-0004/iscsi-p0004-img-01 ........................................... [Owner: cxcto-c240-j27-05.cisco.com, Lun: 1]
  | | o- iscsi-pool-0004/iscsi-p0004-img-02 ........................................... [Owner: cxcto-c240-j27-03.cisco.com, Lun: 4]
  | | o- iscsi-pool-0006/iscsi-p0006-img-01 ........................................... [Owner: cxcto-c240-j27-03.cisco.com, Lun: 2]
  | o- gateways .............................................................................................. [Up: 2/4, Portals: 4]
  | | o- cxcto-c240-j27-02.cisco.com ......................................................................... [10.122.242.197 (UP)]
  | | o- cxcto-c240-j27-03.cisco.com ......................................................................... [10.122.242.198 (UP)]
  | | o- cxcto-c240-j27-04.cisco.com .................................................................... [10.122.242.199 (UNKNOWN)]
  | | o- cxcto-c240-j27-05.cisco.com .................................................................... [10.122.242.200 (UNKNOWN)]
  | o- host-groups .................................................................................................... [Groups : 0]
  | o- hosts ........................................................................................ [Auth: ACL_DISABLED, Hosts: 0]
  o- iqn.2001-07.com.ceph:1627307422533 .................................................................. [Auth: CHAP, Gateways: 4]
    o- disks ............................................................................................................ [Disks: 1]
    | o- iscsi-pool-0007/iscsi-p0007-img-01 ........................................... [Owner: cxcto-c240-j27-04.cisco.com, Lun: 0]
    o- gateways .............................................................................................. [Up: 2/4, Portals: 4]
    | o- cxcto-c240-j27-02.cisco.com ......................................................................... [10.122.242.197 (UP)]
    | o- cxcto-c240-j27-03.cisco.com ......................................................................... [10.122.242.198 (UP)]
    | o- cxcto-c240-j27-04.cisco.com .................................................................... [10.122.242.199 (UNKNOWN)]
    | o- cxcto-c240-j27-05.cisco.com .................................................................... [10.122.242.200 (UNKNOWN)]
    o- host-groups .................................................................................................... [Groups : 0]
    o- hosts ........................................................................................ [Auth: ACL_DISABLED, Hosts: 0]


Currently only cxcto-c240-j27-02 and cxcto-c240-j27-03 are enabled, so I would not expect to see cxcto-c240-j27-04 and cxcto-c240-j27-05 as owning some of the LUNs, but as you can see, they are there. Is this a known issue and is there a way to clean this up? Worst-case now that I know how to make sure the ESXi hosts see all the paths, I can just bring back up the other two that I had removed, but was curious is there was a way to clean this up. I’m guessing something is missing in what cephadm does to clean up when it removes a node. 

-Paul



>> Is there a command that lets me view which gateway is primary for which LUN? I’m guessing when another gateway gets added, the calculation of who is primary for each LUN gets re-calculated and advertised out to the clients?
>> 
> In the `gwcli ls` output, such as:
> 
>     | o- hosts ....................................................................................... [Auth: ACL_ENABLED, Hosts: 1]
>     |   o- iqn.1994-05.com.redhat:client .................................................. [LOGGED-IN, Auth: None, Disks: 3(1026M)]
>     |     o- lun 0 ............................................................................ [datapool/block0(1G), Owner: node01]
>     |     o- lun 1 ............................................................................ [datapool/block1(1M), Owner: node02]
>     |     o- lun 2 ............................................................................ [datapool/block2(1M), Owner: node01]
> 
> The "Owner: node01" means the gateway node01 is primary for the LUN initially, but this not always true, because if the exclusive lock was lost and acquired by node02.
> 
> Actually we should check the "Lock Owner" instead:
> 
> [root@node01 ~]# gwcli disks/datapool/block2 info
> Image                 .. block2
> Ceph Cluster          .. ceph
> Pool                  .. datapool
> Wwn                   .. 7d23f7b4-e0b3-4337-9224-5091513c9d83
> Size H                .. 1M
> Feature List          .. RBD_FEATURE_LAYERING
>                          RBD_FEATURE_EXCLUSIVE_LOCK
>                          RBD_FEATURE_OBJECT_MAP
>                          RBD_FEATURE_FAST_DIFF
>                          RBD_FEATURE_DEEP_FLATTEN
> Snapshots             ..
> Owner                 .. node01
> Lock Owner            .. node02
> State                 .. Online
> Backstore             .. user:rbd
> Backstore Object Name .. datapool.block2
> Control Values
> - hw_max_sectors .. 1024
> - max_data_area_mb .. 8
> - osd_op_timeout .. 30
> - qfull_timeout .. 5
> 
> The "Lock Owner" above is which gateway the the primary currently is. In linux if you test this via multipath, you will see that the "Owner" always equals to "Lock Owner", except there has path failover.
> 
> - Xiubo
> 
> 
>> -Paul
>> 
>> 
>> 
>>>> 
>>>> I did a quick test where I re-enabled a second iSCSI gateway to take a closer look at the paths on the ESXi hosts and I definitely see that when the second path becomes available, different hosts are pointing to different gateways for the Active I/O Path.
>>>> 
>>>> I was reading on how ALUA works and as far as I can tell, isn’t CEPH supposed to indicate to the ESXi hosts which iSCSI gateway “owns” a given LUN at any point so that the hosts know which path to make active?
>>> 
>>> Yeah, the ceph-iscsi/tcmu-runner services will do that. It will report this to the clients.
>>> 
>>> 
>>>> Could there be something wrong where more than one iSCSI gateway is advertising that it owns the LUN to the ESXi hosts?
>>>> 
>>> This has been test and working well in linux in product and the logic never changed for several years.
>>> 
>>> I am not very sure how the ESXi internal will handle this but it should be in compliance with the iscsi proto, in linux the multipath could successfully detect which path is active and will choose it.
>>> 
>>> 
>>>> -Paul
>>>> 
>> 
> 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux