Re: tcmu-runner crashing on 16.2.5

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks. The problem is that when I start gwcli, I get this:

[root@cxcto-c240-j27-02 /]# gwcli
Warning: Could not load preferences file /root/.gwcli/prefs.bin.

2 gateways are inaccessible - updates will be disabled

2 gateways are inaccessible - updates will be disabled

2 gateways are inaccessible - updates will be disabled

I think because it thinks the two gateways are down, it doesn’t let you remove them. It’s a bit of a catch-22.

I tried re-adding the two missing gateways via cephadm so that they come back up and then tried deleting the gateways from gwcli, but that just locked up gwcli even though it actually does seem to have removed it from the configuration but now I’m in a really strange state. I can’t seem to get all the gateways up now and it looks like applying the configuration via cephadm is not actually changing the deployment of iscsi services. If I try to deploy to all 4 servers, I end up with servers 02, 03, and 05 deployed, but 04 never deploys. If I try to change the configuration to only deploy to 03, it still stays deployed on 02, 03, and 05. It’s like it’s stuck somewhere, but I’m not sure where to look.

I currently have the configuration to only enable one gateway, but in ‘ceph orch ls’ I can see that there are 3/1 running (so 3 running even though there should only be 1):

[root@cxcto-c240-j27-01 ~]# ceph orch ls
NAME                               PORTS        RUNNING  REFRESHED  AGE  PLACEMENT
alertmanager                       ?:9093,9094      1/1  2m ago     3M   count:1
crash                                             15/15  4m ago     3M   *
grafana                            ?:3000           1/1  2m ago     3M   count:1
iscsi.iscsi                                         3/1  4m ago     9m   cxcto-c240-j27-03.cisco.com<http://cxcto-c240-j27-03.cisco.com>

And TCMU runner shows 3 hosts up:

  services:
    mon:         5 daemons, quorum cxcto-c240-j27-01.cisco.com<http://cxcto-c240-j27-01.cisco.com>,cxcto-c240-j27-06,cxcto-c240-j27-10,cxcto-c240-j27-08,cxcto-c240-j27-12 (age 16m)
    mgr:         cxcto-c240-j27-01.cisco.com.edgcpk(active, since 16m), standbys: cxcto-c240-j27-02.llzeit
    osd:         329 osds: 326 up (since 4m), 326 in (since 6d)
    tcmu-runner: 28 portals active (3 hosts)

Things seem to have gone from bad to worse now as I can’t get to a clean state were I had 2 gateways running properly since I was able to delete a gateway from one of the targets, but I can’t add it back again since I can’t get all 4 gateways back up since that appears to be the only way that gwcli will work (sort of).

If you have any suggestions on how to get out of this mess I’d appreciate it.

-Paul




On Sep 1, 2021, at 9:17 PM, Xiubo Li <xiubli@xxxxxxxxxx<mailto:xiubli@xxxxxxxxxx>> wrote:


On 9/1/21 12:32 PM, Paul Giralt (pgiralt) wrote:


However, the gwcli command is still showing the other two gateways which are no longer enabled anymore. Where does this list of gateways get stored?

All this configurations are stored in the "gateway.conf" object in "rbd" pool.


How do I access this object? Is it a file or some kind of object store?



Just use the normal rados command:


# rados -p rbd ls
rbd_object_map.137335b78a72
rbd_header.137335b78a72
gateway.conf
rbd_directory
rbd_header.13750fee0be9ae
rbd_id.block2
rbd_object_map.13750fee0be9ae
rbd_object_map.1378bf4c6ef770
rbd_header.1378bf4c6ef770
rbd_id.block4
rbd_id.block3
# rados -p rbd get gateway.conf a.txt

But you'd better don't touch this object manually here, it's risky. If you want to change it you'd better do that by using the REST API or gwcli command.


 It appears that the two gateways that are no longer part of the cluster still appear as the owners of some of the LUNs:

/iscsi-targets> ls
o- iscsi-targets ................................................................................. [DiscoveryAuth: CHAP, Targets: 3]
  o- iqn.2001-07.com.ceph:1622752075720 .................................................................. [Auth: CHAP, Gateways: 4]
  | o- disks ............................................................................................................ [Disks: 5]
  | | o- iscsi-pool-0001/iscsi-p0001-img-01 ........................................... [Owner: cxcto-c240-j27-02.cisco.com<http://cxcto-c240-j27-02.cisco.com/>, Lun: 0]
  | | o- iscsi-pool-0001/iscsi-p0001-img-02 ........................................... [Owner: cxcto-c240-j27-04.cisco.com<http://cxcto-c240-j27-04.cisco.com/>, Lun: 3]
  | | o- iscsi-pool-0003/iscsi-p0003-img-01 ........................................... [Owner: cxcto-c240-j27-03.cisco.com<http://cxcto-c240-j27-03.cisco.com/>, Lun: 1]
  | | o- iscsi-pool-0003/iscsi-p0003-img-02 ........................................... [Owner: cxcto-c240-j27-05.cisco.com<http://cxcto-c240-j27-05.cisco.com/>, Lun: 4]
  | | o- iscsi-pool-0005/iscsi-p0005-img-01 ........................................... [Owner: cxcto-c240-j27-02.cisco.com<http://cxcto-c240-j27-02.cisco.com/>, Lun: 2]
  | o- gateways .............................................................................................. [Up: 2/4, Portals: 4]
  | | o- cxcto-c240-j27-02.cisco.com<http://cxcto-c240-j27-02.cisco.com/> ......................................................................... [10.122.242.197 (UP)]
  | | o- cxcto-c240-j27-03.cisco.com<http://cxcto-c240-j27-03.cisco.com/> ......................................................................... [10.122.242.198 (UP)]
  | | o- cxcto-c240-j27-04.cisco.com<http://cxcto-c240-j27-04.cisco.com/> .................................................................... [10.122.242.199 (UNKNOWN)]
  | | o- cxcto-c240-j27-05.cisco.com<http://cxcto-c240-j27-05.cisco.com/> .................................................................... [10.122.242.200 (UNKNOWN)]
  | o- host-groups .................................................................................................... [Groups : 0]
  | o- hosts ........................................................................................ [Auth: ACL_DISABLED, Hosts: 0]
  o- iqn.2001-07.com.ceph:1622752147345 .................................................................. [Auth: CHAP, Gateways: 4]
  | o- disks ............................................................................................................ [Disks: 5]
  | | o- iscsi-pool-0002/iscsi-p0002-img-01 ........................................... [Owner: cxcto-c240-j27-04.cisco.com<http://cxcto-c240-j27-04.cisco.com/>, Lun: 0]
  | | o- iscsi-pool-0002/iscsi-p0002-img-02 ........................................... [Owner: cxcto-c240-j27-02.cisco.com<http://cxcto-c240-j27-02.cisco.com/>, Lun: 3]
  | | o- iscsi-pool-0004/iscsi-p0004-img-01 ........................................... [Owner: cxcto-c240-j27-05.cisco.com<http://cxcto-c240-j27-05.cisco.com/>, Lun: 1]
  | | o- iscsi-pool-0004/iscsi-p0004-img-02 ........................................... [Owner: cxcto-c240-j27-03.cisco.com<http://cxcto-c240-j27-03.cisco.com/>, Lun: 4]
  | | o- iscsi-pool-0006/iscsi-p0006-img-01 ........................................... [Owner: cxcto-c240-j27-03.cisco.com<http://cxcto-c240-j27-03.cisco.com/>, Lun: 2]
  | o- gateways .............................................................................................. [Up: 2/4, Portals: 4]
  | | o- cxcto-c240-j27-02.cisco.com<http://cxcto-c240-j27-02.cisco.com/> ......................................................................... [10.122.242.197 (UP)]
  | | o- cxcto-c240-j27-03.cisco.com<http://cxcto-c240-j27-03.cisco.com/> ......................................................................... [10.122.242.198 (UP)]
  | | o- cxcto-c240-j27-04.cisco.com<http://cxcto-c240-j27-04.cisco.com/> .................................................................... [10.122.242.199 (UNKNOWN)]
  | | o- cxcto-c240-j27-05.cisco.com<http://cxcto-c240-j27-05.cisco.com/> .................................................................... [10.122.242.200 (UNKNOWN)]
  | o- host-groups .................................................................................................... [Groups : 0]
  | o- hosts ........................................................................................ [Auth: ACL_DISABLED, Hosts: 0]
  o- iqn.2001-07.com.ceph:1627307422533 .................................................................. [Auth: CHAP, Gateways: 4]
    o- disks ............................................................................................................ [Disks: 1]
    | o- iscsi-pool-0007/iscsi-p0007-img-01 ........................................... [Owner: cxcto-c240-j27-04.cisco.com<http://cxcto-c240-j27-04.cisco.com/>, Lun: 0]
    o- gateways .............................................................................................. [Up: 2/4, Portals: 4]
    | o- cxcto-c240-j27-02.cisco.com<http://cxcto-c240-j27-02.cisco.com/> ......................................................................... [10.122.242.197 (UP)]
    | o- cxcto-c240-j27-03.cisco.com<http://cxcto-c240-j27-03.cisco.com/> ......................................................................... [10.122.242.198 (UP)]
    | o- cxcto-c240-j27-04.cisco.com<http://cxcto-c240-j27-04.cisco.com/> .................................................................... [10.122.242.199 (UNKNOWN)]
    | o- cxcto-c240-j27-05.cisco.com<http://cxcto-c240-j27-05.cisco.com/> .................................................................... [10.122.242.200 (UNKNOWN)]
    o- host-groups .................................................................................................... [Groups : 0]
    o- hosts ........................................................................................ [Auth: ACL_DISABLED, Hosts: 0]


Currently only cxcto-c240-j27-02 and cxcto-c240-j27-03 are enabled, so I would not expect to see cxcto-c240-j27-04 and cxcto-c240-j27-05 as owning some of the LUNs, but as you can see, they are there. Is this a known issue and is there a way to clean this up? Worst-case now that I know how to make sure the ESXi hosts see all the paths, I can just bring back up the other two that I had removed, but was curious is there was a way to clean this up. I’m guessing something is missing in what cephadm does to clean up when it removes a node.

It seems the cephadm or you didn't clean that up. How did that two stale gateways come ? Before upgrading you were using them ? And after upgrading you switched to -02 and -03 ones ?


I’m not sure what you mean by “you didn’t clean that up”. Are there steps I need to take to clean up besides re-applying the configuration using ceph orch?

For these stale gateways you can remove them manually by using the REST API or gwcli command.


This cluster was initially installed on 16.2.4. The only upgrade I’ve done was to 16.2.5. All 4 gateways were present before and after the upgrade. I only recently removed the two (well, actually I had removed 3 of them leaving only one) as a workaround for this problem. After I figured out what was causing the issue on the ESXi side, I added one back.


If you remove those two gateways via the gwcli or REST API directly, if there has not errors in ceph-iscsi logs it should work very well. I haven't hit any issue about this yet till now.


The way that I’m adding / removing is through a yaml file like this:

service_type: iscsi
service_id: iscsi
placement:
  hosts:
    - cxcto-c240-j27-02.cisco.com<http://cxcto-c240-j27-02.cisco.com/>
    - cxcto-c240-j27-03.cisco.com<http://cxcto-c240-j27-03.cisco.com/>
spec:
  pool: iscsi-config

(I’ve removed the lines with the username / password here)

Originally the file had 4 hosts, then I switched it to 1, and now there are 2. I’m applying the configuration using "ceph orch apply -I iscsi.yaml”

ceph orch ls seems to show the correct configuration of only two gateways configured.

BTW - I’ve always had this problem from day 1 that I filed a bug for - https://tracker.ceph.com/issues/51111#change-199548 - not sure if it’s related, but it looks like tracking the tcmu-runner containers has never quite worked properly.

Yeah, if you are using the cephadm, it possibly buggy there.

- Xiubo


-Paul

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux