Re: tcmu-runner crashing on 16.2.5

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Sep 3, 2021, at 4:28 AM, Xiubo Li <xiubli@xxxxxxxxxx<mailto:xiubli@xxxxxxxxxx>> wrote:

And TCMU runner shows 3 hosts up:

  services:
    mon:         5 daemons, quorum cxcto-c240-j27-01.cisco.com<http://cxcto-c240-j27-01.cisco.com/>,cxcto-c240-j27-06,cxcto-c240-j27-10,cxcto-c240-j27-08,cxcto-c240-j27-12 (age 16m)
    mgr:         cxcto-c240-j27-01.cisco.com.edgcpk(active, since 16m), standbys: cxcto-c240-j27-02.llzeit
    osd:         329 osds: 326 up (since 4m), 326 in (since 6d)
    tcmu-runner: 28 portals active (3 hosts)

Could you check all the gateways nodes whether the tcmu-runner service is still alive in all of them ?

The status will be reported by the tcmu-runner service, not the ceph-iscsi.


That’s the issue I’m having now - I can’t get the iscsi services (both the api gateway and tcmu runner) to start on one of the 4 servers for some reason. Since I’m using cephadm to orchestrate the enabling / disabling of services on nodes, I first used cephadm to add all 4 gateways back. They all were running and gwcli allowed me to make a change to try and remove one portal from one target, however gwcli locked up when I did this. It looks like the configuration change took place, however after that event, now cephadm does not appear to be able to properly orchestrate the addition / removal of iscsi gateways. I’m in a state where it’s trying to run on 3 of the servers (02, 03, 05) no matter what I do. If I set cephadm to run iscsi only on node 03, for example, it keeps running on 02 and 05 as well. If I set cephadm to run on all 4 servers, it still only runs on 02, 03, and 05. It won’t start on 04 anymore. I’m not really sure how to see if it’s even trying, as I’m not sure how cephadm orchestrates the deployment of the containers.



Things seem to have gone from bad to worse now as I can’t get to a clean state were I had 2 gateways running properly since I was able to delete a gateway from one of the targets, but I can’t add it back again since I can’t get all 4 gateways back up since that appears to be the only way that gwcli will work (sort of).

If you have any suggestions on how to get out of this mess I’d appreciate it.

Since the ceph-iscsi couldn't connect the stale gateways so it just forbids you change anything of it. Could you check whether the rbd-target-api service is alive ?

Then you can try to change the 'gateway.conf' to modify it.

Let’s say that 2 of the servers were dead for some reason and there way no way to get them back online. Is the only way to resolve in that case to modify gateway.conf? I’m a little nervous about doing this based on your last email saying to not mess with the file, but I was able to download it and it looks like modifying it would be relatively straightforward. Who is responsible for creating that file? I’m thinking what I should probably do is:

- Shut down ESXi cluster so there are no iSCSI accesses
- Tell cephadm to underplay all iscsi gateways. If this doesn’t work (which it probably won’t) then just stop the tcmu-runner and iscsi containers on all serves so they’re not running.
- Modify gateway.conf to remove the gateways except for two
- Try to use cephadm to re-deploy on the two servers
- Bring back up the ESXi hosts.

Does this sound like a reasonable plan? I’m not sure if there is anything else to look at on the cephadm side to understand why the services are no longer being added/removed anymore.

-Paul


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux