Re: Cephadm not properly adding / removing iscsi services anymore

"Paul Giralt (pgiralt)" <pgiralt@xxxxxxxxx> · Wed, 8 Sep 2021 14:13:25 +0000

Thanks for the tip. I’ve just been using ‘docker exec -it <container id> /bin/bash’ to get into the containers, but those commands sound useful. I think I’ll install cephadm on all nodes just for this. 

Thanks again, 
-Paul

> On Sep 8, 2021, at 10:11 AM, Eugen Block <eblock@xxxxxx> wrote:
> 
> Okay, I'm glad it worked!
> 
> 
>> At first I tried cephadm rm-daemon on the bootstrap node that I usually do all management from and it indicated that it could not remove the daemon:
>> 
>> [root@cxcto-c240-j27-01 ~]# cephadm rm-daemon --name iscsi.cxcto-c240-j27-04.lgqtxo --fsid 4a29e724-c4a6-11eb-b14a-5c838f8013a5
>> ERROR: Daemon not found: iscsi.cxcto-c240-j27-04.lgqtxo. See `cephadm ls`
>> 
>> When I would do ‘cephadm ls’ I only saw services running locally on that server, not the whole cluster. I’m not sure if this is expected or not.
> 
> As far as I can tell this is expected, yes. I have only a lab environment with containers (we're still hesitating to upgrade to Octopus) but all virtual nodes have cephadm installed, I thought that was a requirement, I may be wrong though. But it definitely helps you to debug, for example with 'cephadm enter --name <daemon>' you get a shell for that container or 'cephadm logs --name <daemon>' you can inspect specific logs.
> 
> 
> Zitat von "Paul Giralt (pgiralt)" <pgiralt@xxxxxxxxx>:
> 
>> Thanks Eugen.
>> 
>> At first I tried cephadm rm-daemon on the bootstrap node that I usually do all management from and it indicated that it could not remove the daemon:
>> 
>> [root@cxcto-c240-j27-01 ~]# cephadm rm-daemon --name iscsi.cxcto-c240-j27-04.lgqtxo --fsid 4a29e724-c4a6-11eb-b14a-5c838f8013a5
>> ERROR: Daemon not found: iscsi.cxcto-c240-j27-04.lgqtxo. See `cephadm ls`
>> 
>> When I would do ‘cephadm ls’ I only saw services running locally on that server, not the whole cluster. I’m not sure if this is expected or not. I installed cephadm on the cxcto-c240-j27-04 server and issued the command and it worked. It looks like when I did this, suddenly the containers on the other two servers that were not supposed to be running the iscsi gateway were removed and everything appeared to be back to normal. I then added back one server to the yaml file and applied it on the original bootstrap node and it got deployed properly, so it appears that everything is working again. Somehow deleting that daemon on the 04 server got everything working again.
>> 
>> Still not exactly sure why that fixed it, but at least it’s working again. Thanks for the suggestion.
>> 
>> -Paul
>> 
>> 
>>> On Sep 8, 2021, at 4:12 AM, Eugen Block <eblock@xxxxxx> wrote:
>>> 
>>> If you only configured 1 iscsi gw but you see 3 running, have you tried to destroy them with 'cephadm rm-daemon --name ...'? On the active MGR host run 'journalctl -f' and you'll see plenty of information, it should also contain information about the iscsi deployment. Or run 'cephadm logs --name <iscsi-gw>'.
>>> 
>>> 
>>> Zitat von "Paul Giralt (pgiralt)" <pgiralt@xxxxxxxxx>:
>>> 
>>>> This was working until recently and now seems to have stopped working. Running Pacific 16.2.5. When I modify the deployment YAML file for my iscsi gateways, the services are not being added or removed as requested. It’s as if the state is “stuck”.
>>>> 
>>>> At one point I had 4 iSCSI gateways: 02, 03, 04 and 05. Through some back and forth of deploying and undeploying, I ended up in a state where the services are running on servers 02, 03, and 05 no matter what I tell cephadm to do. For example, right now I have the following configuration:
>>>> 
>>>> service_type: iscsi
>>>> service_id: iscsi
>>>> placement:
>>>> hosts:
>>>>   - cxcto-c240-j27-03.cisco.com
>>>> spec:
>>>> pool: iscsi-config
>>>> … removed the rest of this file ….
>>>> 
>>>> However ceph orch ls shows this:
>>>> 
>>>> [root@cxcto-c240-j27-01 ~]# ceph orch ls
>>>> NAME                               PORTS        RUNNING  REFRESHED  AGE  PLACEMENT
>>>> alertmanager                       ?:9093,9094      1/1  9m ago     3M   count:1
>>>> crash                                             15/15  10m ago    3M   *
>>>> grafana                            ?:3000           1/1  9m ago     3M   count:1
>>>> iscsi.iscsi                                         3/1  10m ago    11m  cxcto-c240-j27-03.cisco.com
>>>> mgr                                                 2/2  9m ago     3M   count:2
>>>> mon                                                 5/5  9m ago     12d  cxcto-c240-j27-01.cisco.com;cxcto-c240-j27-06.cisco.com;cxcto-c240-j27-08.cisco.com;cxcto-c240-j27-10.cisco.com;cxcto-c240-j27-12.cisco.com
>>>> node-exporter                      ?:9100         15/15  10m ago    3M   *
>>>> osd.dashboard-admin-1622750977792                  0/15  -          3M   *
>>>> osd.dashboard-admin-1622751032319               326/341  10m ago    3M   *
>>>> prometheus                         ?:9095           1/1  9m ago     3M   count:1
>>>> 
>>>> Notice it shows 3/1 because the service is still running on 3 servers even though I’ve told it to only run on one. If I configure all 4 servers and apply (ceph orch apply) then I end up with 3/4 because server 04 never deploys. It’s as if something is “stuck”.
>>>> 
>>>> Any ideas where to look / log files that might help figure out what’s happening?
>>>> 
>>>> -Paul
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx