Re: tcmu-runner crashing on 16.2.5

"Paul Giralt (pgiralt)" <pgiralt@xxxxxxxxx> · Wed, 8 Sep 2021 19:36:22 +0000

Thank you Xiubo. confirm=true worked and I was able to update via gwcli and then get everything reset back to normal again. I’m stable for now but still hoping that this fix can get in soon to make sure the crash doesn’t happen again.

Appreciate all your help on this.

-Paul

On Sep 6, 2021, at 7:29 AM, Xiubo Li <xiubli@xxxxxxxxxx<mailto:xiubli@xxxxxxxxxx>> wrote:

On 9/3/21 11:32 PM, Paul Giralt (pgiralt) wrote:

On Sep 3, 2021, at 4:28 AM, Xiubo Li <xiubli@xxxxxxxxxx<mailto:xiubli@xxxxxxxxxx>> wrote:

And TCMU runner shows 3 hosts up:

  services:
    mon:         5 daemons, quorum cxcto-c240-j27-01.cisco.com<http://cxcto-c240-j27-01.cisco.com/>,cxcto-c240-j27-06,cxcto-c240-j27-10,cxcto-c240-j27-08,cxcto-c240-j27-12 (age 16m)
    mgr:         cxcto-c240-j27-01.cisco.com.edgcpk(active, since 16m), standbys: cxcto-c240-j27-02.llzeit
    osd:         329 osds: 326 up (since 4m), 326 in (since 6d)
    tcmu-runner: 28 portals active (3 hosts)

Could you check all the gateways nodes whether the tcmu-runner service is still alive in all of them ?

The status will be reported by the tcmu-runner service, not the ceph-iscsi.

That’s the issue I’m having now - I can’t get the iscsi services (both the api gateway and tcmu runner) to start on one of the 4 servers for some reason. Since I’m using cephadm to orchestrate the enabling / disabling of services on nodes, I first used cephadm to add all 4 gateways back. They all were running and gwcli allowed me to make a change to try and remove one portal from one target, however gwcli locked up when I did this. It looks like the configuration change took place, however after that event, now cephadm does not appear to be able to properly orchestrate the addition / removal of iscsi gateways. I’m in a state where it’s trying to run on 3 of the servers (02, 03, 05) no matter what I do. If I set cephadm to run iscsi only on node 03, for example, it keeps running on 02 and 05 as well. If I set cephadm to run on all 4 servers, it still only runs on 02, 03, and 05. It won’t start on 04 anymore. I’m not really sure how to see if it’s even trying, as I’m not sure how cephadm orchestrates the deployment of the containers.

Have your tried "confirm=true" when deleting those two stale gateways ?

For example in my setups, I powered off the node02:

$ gwcli

...

    | o- gateways ............................................................................................ [Up: 1/2, Portals: 2]
    | | o- node01 ............................................................................................ [172.16.219.128 (UP)]
    | | o- node02 ....................................................................................... [172.16.219.138 (UNKNOWN)]

...

/> iscsi-targets/iqn.2003-01.com.redhat.iscsi-gw:ceph-igw/gateways/ delete gateway_name=node02
Deleting gateway, node02

Could not contact node02. If the gateway is permanently down. Use confirm=true to force removal. WARNING: Forcing removal of a gateway that can still be reached by an initiator may result in data corruption.
/>
/> iscsi-targets/iqn.2003-01.com.redhat.iscsi-gw:ceph-igw/gateways/ delete gateway_name=node02 confirm=true
Deleting gateway, node02
/>

I could remove it without bring the gateway node02 up.

Things seem to have gone from bad to worse now as I can’t get to a clean state were I had 2 gateways running properly since I was able to delete a gateway from one of the targets, but I can’t add it back again since I can’t get all 4 gateways back up since that appears to be the only way that gwcli will work (sort of).

If you have any suggestions on how to get out of this mess I’d appreciate it.

Since the ceph-iscsi couldn't connect the stale gateways so it just forbids you change anything of it. Could you check whether the rbd-target-api service is alive ?

Then you can try to change the 'gateway.conf' to modify it.

Let’s say that 2 of the servers were dead for some reason and there way no way to get them back online. Is the only way to resolve in that case to modify gateway.conf? I’m a little nervous about doing this based on your last email saying to not mess with the file, but I was able to download it and it looks like modifying it would be relatively straightforward. Who is responsible for creating that file? I’m thinking what I should probably do is:

- Shut down ESXi cluster so there are no iSCSI accesses
- Tell cephadm to underplay all iscsi gateways. If this doesn’t work (which it probably won’t) then just stop the tcmu-runner and iscsi containers on all serves so they’re not running.
- Modify gateway.conf to remove the gateways except for two
- Try to use cephadm to re-deploy on the two servers
- Bring back up the ESXi hosts.

Does this sound like a reasonable plan? I’m not sure if there is anything else to look at on the cephadm side to understand why the services are no longer being added/removed anymore.

-Paul

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx