Re: Orchestration seems not to work

Thomas Widhalm <widhalmt@xxxxxxxxxxxxx> · Mon, 15 May 2023 17:43:39 +0200

Hi,

I tried a lot of different approaches but I didn't have any success so far.

"ceph orch ps" still doesn't get refreshed.

Some examples:

mds.mds01.ceph06.huavsw  ceph06               starting              - 
-        -        -  <unknown>  <unknown>     <unknown>
mds.mds01.ceph06.rrxmks  ceph06               error            4w ago 
3M        -        -  <unknown>  <unknown>     <unknown>
mds.mds01.ceph07.omdisd  ceph07               error            4w ago 
4M        -        -  <unknown>  <unknown>     <unknown>
mds.mds01.ceph07.vvqyma  ceph07               starting              - 
-        -        -  <unknown>  <unknown>     <unknown>
mgr.ceph04.qaexpv        ceph04  *:8443,9283  running (4w)     4w ago 
10M     551M        -  17.2.6     9cea3956c04b  33df84e346a0
mgr.ceph05.jcmkbb        ceph05  *:8443,9283  running (4w)     4w ago 
4M     441M        -  17.2.6     9cea3956c04b  1ad485df4399
mgr.ceph06.xbduuf        ceph06  *:8443,9283  running (4w)     4w ago 
4M     432M        -  17.2.6     9cea3956c04b  5ba5fd95dc48
mon.ceph04               ceph04               running (4w)     4w ago 
4M     223M    2048M  17.2.6     9cea3956c04b  8b6116dd216f
mon.ceph05               ceph05               running (4w)     4w ago 
4M     326M    2048M  17.2.6     9cea3956c04b  70520d737f29

Debug Log doesn't show anything that could help me, either.

2023-05-15T14:48:40.852088+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1376 : 
cephadm [INF] Schedule start daemon mds.mds01.ceph04.hcmvae
2023-05-15T14:48:43.620700+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1380 : 
cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.hcmvae
2023-05-15T14:48:45.124822+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1392 : 
cephadm [INF] Schedule start daemon mds.mds01.ceph04.krxszj
2023-05-15T14:48:46.493902+0000 mgr.ceph05.jcmkbb (mgr.83897390) 1394 : 
cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.krxszj
2023-05-15T15:05:25.637079+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2629 : 
cephadm [INF] Saving service mds.mds01 spec with placement count:2
2023-05-15T15:07:27.625773+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2780 : 
cephadm [INF] Saving service mds.fs_name spec with placement count:3
2023-05-15T15:07:42.120912+0000 mgr.ceph05.jcmkbb (mgr.83897390) 2795 : 
cephadm [INF] Saving service mds.mds01 spec with placement count:3

I'm seeing all the commands I give but I don't get any more information 
on why it's not actually happening.

I tried to change different scheduling mechanisms. Host, Tag, unmanaged 
and back again. I turned off orchestration and resumed. I failed mgr. I 
even had full cluster stops (in the past). I made sure all daemons run 
the same version. (If you remember, upgrade failed underway).

So my only way of getting daemons only is manually. I added two more 
hosts, tagged them. But there isn't a single daemon started there.

Could you help me again with how to debug orchestration not working?

On 04.05.23 15:12, Thomas Widhalm wrote:
Thanks.

I set the log level to debug, try a few steps and then come back.

On 04.05.23 14:48, Eugen Block wrote:
Hi,

try setting debug logs for the mgr:

ceph config set mgr mgr/cephadm/log_level debug

This should provide more details what the mgr is trying and where it's 
failing, hopefully. Last week this helped to identify an issue between 
a lower pacific issue for me.
Do you see anything in the cephadm.log pointing to the mgr actually 
trying something?

Zitat von Thomas Widhalm <widhalmt@xxxxxxxxxxxxx>:

Hi,

I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but 
the following problem existed when I was still everywhere on 17.2.5 .

I had a major issue in my cluster which could be solved with a lot of 
your help and even more trial and error. Right now it seems that most 
is already fixed but I can't rule out that there's still some problem 
hidden. The very issue I'm asking about started during the repair.

When I want to orchestrate the cluster, it logs the command but it 
doesn't do anything. No matter if I use ceph dashboard or "ceph orch" 
in "cephadm shell". I don't get any error message when I try to 
deploy new services, redeploy them etc. The log only says "scheduled" 
and that's it. Same when I change placement rules. Usually I use 
tags. But since they don't work anymore, too, I tried host and 
umanaged. No success. The only way I can actually start and stop 
containers is via systemctl from the host itself.

When I run "ceph orch ls" or "ceph orch ps" I see services I deployed 
for testing being deleted (for weeks now). Ans especially a lot of 
old MDS are listed as "error" or "starting". The list doesn't match 
reality at all because I had to start them by hand.

I tried "ceph mgr fail" and even a complete shutdown of the whole 
cluster with all nodes including all mgs, mds even osd - everything 
during a maintenance window. Didn't change anything.

Could you help me? To be honest I'm still rather new to Ceph and 
since I didn't find anything in the logs that caught my eye I would 
be thankful for hints how to debug.

Cheers,
Thomas
--
http://www.widhalm.or.at
GnuPG : 6265BAE6 , A84CB603
Threema: H7AV7D33
Telegram, Signal: widhalmt@xxxxxxxxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
Attachment:
OpenPGP_signature

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx