Re: Cephadm not properly adding / removing iscsi services anymore

Eugen Block <eblock@xxxxxx> · Wed, 08 Sep 2021 14:11:21 +0000

Okay, I'm glad it worked!

At first I tried cephadm rm-daemon on the bootstrap node that I  
usually do all management from and it indicated that it could not  
remove the daemon:

[root@cxcto-c240-j27-01 ~]# cephadm rm-daemon --name  
iscsi.cxcto-c240-j27-04.lgqtxo --fsid  
4a29e724-c4a6-11eb-b14a-5c838f8013a5
ERROR: Daemon not found: iscsi.cxcto-c240-j27-04.lgqtxo. See `cephadm ls`

When I would do ‘cephadm ls’ I only saw services running locally on  
that server, not the whole cluster. I’m not sure if this is expected  
or not.

As far as I can tell this is expected, yes. I have only a lab  
environment with containers (we're still hesitating to upgrade to  
Octopus) but all virtual nodes have cephadm installed, I thought that  
was a requirement, I may be wrong though. But it definitely helps you  
to debug, for example with 'cephadm enter --name <daemon>' you get a  
shell for that container or 'cephadm logs --name <daemon>' you can  
inspect specific logs.

Zitat von "Paul Giralt (pgiralt)" <pgiralt@xxxxxxxxx>:

Thanks Eugen.

At first I tried cephadm rm-daemon on the bootstrap node that I  
usually do all management from and it indicated that it could not  
remove the daemon:

[root@cxcto-c240-j27-01 ~]# cephadm rm-daemon --name  
iscsi.cxcto-c240-j27-04.lgqtxo --fsid  
4a29e724-c4a6-11eb-b14a-5c838f8013a5
ERROR: Daemon not found: iscsi.cxcto-c240-j27-04.lgqtxo. See `cephadm ls`

When I would do ‘cephadm ls’ I only saw services running locally on  
that server, not the whole cluster. I’m not sure if this is expected  
or not. I installed cephadm on the cxcto-c240-j27-04 server and  
issued the command and it worked. It looks like when I did this,  
suddenly the containers on the other two servers that were not  
supposed to be running the iscsi gateway were removed and everything  
appeared to be back to normal. I then added back one server to the  
yaml file and applied it on the original bootstrap node and it got  
deployed properly, so it appears that everything is working again.  
Somehow deleting that daemon on the 04 server got everything working  
again.

Still not exactly sure why that fixed it, but at least it’s working  
again. Thanks for the suggestion.

-Paul

On Sep 8, 2021, at 4:12 AM, Eugen Block <eblock@xxxxxx> wrote:

If you only configured 1 iscsi gw but you see 3 running, have you  
tried to destroy them with 'cephadm rm-daemon --name ...'? On the  
active MGR host run 'journalctl -f' and you'll see plenty of  
information, it should also contain information about the iscsi  
deployment. Or run 'cephadm logs --name <iscsi-gw>'.

Zitat von "Paul Giralt (pgiralt)" <pgiralt@xxxxxxxxx>:

This was working until recently and now seems to have stopped  
working. Running Pacific 16.2.5. When I modify the deployment YAML  
file for my iscsi gateways, the services are not being added or  
removed as requested. It’s as if the state is “stuck”.

At one point I had 4 iSCSI gateways: 02, 03, 04 and 05. Through  
some back and forth of deploying and undeploying, I ended up in a  
state where the services are running on servers 02, 03, and 05 no  
matter what I tell cephadm to do. For example, right now I have  
the following configuration:

service_type: iscsi
service_id: iscsi
placement:
 hosts:
   - cxcto-c240-j27-03.cisco.com
spec:
 pool: iscsi-config
… removed the rest of this file ….

However ceph orch ls shows this:

[root@cxcto-c240-j27-01 ~]# ceph orch ls
NAME                               PORTS        RUNNING  REFRESHED  
 AGE  PLACEMENT
alertmanager                       ?:9093,9094      1/1  9m ago     
 3M   count:1
crash                                             15/15  10m ago    3M   *
grafana                            ?:3000           1/1  9m ago     
 3M   count:1
iscsi.iscsi                                         3/1  10m ago    
 11m  cxcto-c240-j27-03.cisco.com
mgr                                                 2/2  9m ago     
 3M   count:2
mon                                                 5/5  9m ago     
 12d   
cxcto-c240-j27-01.cisco.com;cxcto-c240-j27-06.cisco.com;cxcto-c240-j27-08.cisco.com;cxcto-c240-j27-10.cisco.com;cxcto-c240-j27-12.cisco.com
node-exporter                      ?:9100         15/15  10m ago    3M   *
osd.dashboard-admin-1622750977792                  0/15  -          3M   *
osd.dashboard-admin-1622751032319               326/341  10m ago    3M   *
prometheus                         ?:9095           1/1  9m ago     
 3M   count:1

Notice it shows 3/1 because the service is still running on 3  
servers even though I’ve told it to only run on one. If I  
configure all 4 servers and apply (ceph orch apply) then I end up  
with 3/4 because server 04 never deploys. It’s as if something is  
“stuck”.

Any ideas where to look / log files that might help figure out  
what’s happening?

-Paul

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx