Re: Orchestration seems not to work

Thomas Widhalm <widhalmt@xxxxxxxxxxxxx> · Thu, 4 May 2023 17:01:11 +0200

I uploaded the output there: 
https://nextcloud.widhalm.or.at/nextcloud/s/FCqPM8zRsix3gss

IP 192.168.23.62 is one of my OSDs that were still booting when the 
reconnect tries happened. What makes me wonder is that it's the only one 
listed when there are a few similar ones in the cluster.

On 04.05.23 16:55, Adam King wrote:
what does specifically `ceph log last 200 debug cephadm` spit out? The log
lines you've posted so far I don't think are generated by the orchestrator
so curious what the last actions it took was (and how long ago).

On Thu, May 4, 2023 at 10:35 AM Thomas Widhalm <widhalmt@xxxxxxxxxxxxx>
wrote:

To completely rule out hung processes, I managed to get another short
shutdown.

Now I'm seeing lots of:

mgr.server handle_open ignoring open from mds.mds01.ceph01.usujbi
v2:192.168.23.61:6800/2922006253; not ready for session (expect reconnect)
mgr finish mon failed to return metadata for mds.mds01.ceph02.otvipq:
(2) No such file or directory

log lines. Seems like it now realises that some of these informations
are stale. But it looks like it's just waiting for it to come back and
not do anything about it.

On 04.05.23 14:48, Eugen Block wrote:
Hi,

try setting debug logs for the mgr:

ceph config set mgr mgr/cephadm/log_level debug

This should provide more details what the mgr is trying and where it's
failing, hopefully. Last week this helped to identify an issue between a
lower pacific issue for me.
Do you see anything in the cephadm.log pointing to the mgr actually
trying something?

Zitat von Thomas Widhalm <widhalmt@xxxxxxxxxxxxx>:

Hi,

I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but
the following problem existed when I was still everywhere on 17.2.5 .

I had a major issue in my cluster which could be solved with a lot of
your help and even more trial and error. Right now it seems that most
is already fixed but I can't rule out that there's still some problem
hidden. The very issue I'm asking about started during the repair.

When I want to orchestrate the cluster, it logs the command but it
doesn't do anything. No matter if I use ceph dashboard or "ceph orch"
in "cephadm shell". I don't get any error message when I try to deploy
new services, redeploy them etc. The log only says "scheduled" and
that's it. Same when I change placement rules. Usually I use tags. But
since they don't work anymore, too, I tried host and umanaged. No
success. The only way I can actually start and stop containers is via
systemctl from the host itself.

When I run "ceph orch ls" or "ceph orch ps" I see services I deployed
for testing being deleted (for weeks now). Ans especially a lot of old
MDS are listed as "error" or "starting". The list doesn't match
reality at all because I had to start them by hand.

I tried "ceph mgr fail" and even a complete shutdown of the whole
cluster with all nodes including all mgs, mds even osd - everything
during a maintenance window. Didn't change anything.

Could you help me? To be honest I'm still rather new to Ceph and since
I didn't find anything in the logs that caught my eye I would be
thankful for hints how to debug.

Cheers,
Thomas
--
http://www.widhalm.or.at
GnuPG : 6265BAE6 , A84CB603
Threema: H7AV7D33
Telegram, Signal: widhalmt@xxxxxxxxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
Attachment:
OpenPGP_signature

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx