Re: ceph orch problem

Eugen Block <eblock@xxxxxx> · Fri, 03 Nov 2023 13:20:10 +0000

Could you add more info from the mgr log (not only the failing  
container logs)? Something like this:

cephadm logs --name mgr.ceph02-hn02.ofencx

And what's with the mds daemons? I have seen mgr actions blocked by an  
HEALTH_ERR state, maybe you're experiencing that as well here.

Zitat von Dario Graña <dgrana@xxxxxx>:

Actually the cluster is in an error state due to (I think) these problems.

ceph -s
  cluster:
    id:     lksdjf
    health: HEALTH_ERR
            18 failed cephadm daemon(s)
            2 filesystems are degraded
            1 filesystem has a failed mds daemon
            1 filesystem is offline
            1 mds daemon damaged
            insufficient standby MDS daemons available

  services:
    mon: 3 daemons, quorum ceph02-hn02,ceph02-hn03,ceph02-hn04 (age 7m)
    mgr: ceph02-hn02.ofencx(active, since 29m), standbys: ceph02-hn03.dxswor
    mds: 0/2 daemons up (1 failed)
    osd: 264 osds: 264 up (since 24h), 263 in (since 2m)

  data:
    volumes: 0/2 healthy, 1 recovering, 1 failed; 1 damaged
    pools:   16 pools, 2177 pgs
    objects: 253.57M objects, 154 TiB
    usage:   338 TiB used, 3.1 PiB / 3.4 PiB avail
    pgs:     2177 active+clean

  io:
    client:   1.2 KiB/s rd, 39 MiB/s wr, 0 op/s rd, 19 op/s wr

Some outputs I found in the logs are below

Nov  1 09:47:12 ceph02-hn02 podman[645065]: 2023-11-01 09:47:12.812391169
+0100 CET m=+0.017512707 container died
3158b09500dc0ef3a8ce3282c87c5b4b5aae8d490343cae569752e69776c7683 (image=
quay.io/ceph/ceph@sha256:673b48521fd53e1b4bc7dda96335505c4d4b2e13d7bb92bf2e7782e2083094c9,
name=ceph-<ceph-id>-mgr-ceph02-hn02-ofencx, GIT_BRANCH=HEAD, ceph=True,
GIT_CLEAN=True, org.label-schema.build-date=20230622,
org.label-schema.name=CentOS
Stream 8 Base Image, GIT_REPO=https://github.com/ceph/ceph-container.git,
org.label-schema.schema-version=1.0, CEPH_POINT_RELEASE=-17.2.6,
org.label-schema.vendor=CentOS, org.label-schema.license=GPLv2,
GIT_COMMIT=e0efdfe8a55d4257c30bd4991364ca6f2fc7e58e,
io.buildah.version=1.19.8, maintainer=Guillaume Abrioux  
<gabrioux@xxxxxxxxxx>,
RELEASE=HEAD)
Nov  1 09:47:12 ceph02-hn02 podman[645065]: 2023-11-01 09:47:12.822679389
+0100 CET m=+0.027800907 container remove
3158b09500dc0ef3a8ce3282c87c5b4b5aae8d490343cae569752e69776c7683 (image=
quay.io/ceph/ceph@sha256:673b48521fd53e1b4bc7dda96335505c4d4b2e13d7bb92bf2e7782e2083094c9,
name=ceph-<ceph-id>-mgr-ceph02-hn02-ofencx, GIT_CLEAN=True,
org.label-schema.license=GPLv2, org.label-schema.vendor=CentOS,
maintainer=Guillaume Abrioux <gabrioux@xxxxxxxxxx>, GIT_BRANCH=HEAD,
GIT_REPO=https://github.com/ceph/ceph-container.git,
org.label-schema.name=CentOS
Stream 8 Base Image, ceph=True, io.buildah.version=1.19.8,
org.label-schema.schema-version=1.0,
GIT_COMMIT=e0efdfe8a55d4257c30bd4991364ca6f2fc7e58e, RELEASE=HEAD,
CEPH_POINT_RELEASE=-17.2.6, org.label-schema.build-date=20230622)
Nov  1 09:47:12 ceph02-hn02 systemd[1]:
ceph-<ceph-id>@mgr.ceph02-hn02.ofencx.service: Main process exited,
code=exited, status=137/n/a
Nov  1 09:47:13 ceph02-hn02 systemd[1]:
ceph-<ceph-id>@mgr.ceph02-hn02.ofencx.service: Failed with result
'exit-code'.
Nov  1 09:47:13 ceph02-hn02 systemd[1]:
ceph-<ceph-id>@mgr.ceph02-hn02.ofencx.service: Consumed 25.730s CPU time.
Nov  1 09:47:23 ceph02-hn02 systemd[1]:
ceph-<ceph-id>@mgr.ceph02-hn02.ofencx.service: Scheduled restart job,
restart counter is at 3.
Nov  1 09:47:23 ceph02-hn02 systemd[1]: Stopped Ceph mgr.ceph02-hn02.ofencx
for <ceph-id>.
Nov  1 09:47:23 ceph02-hn02 systemd[1]:
ceph-<ceph-id>@mgr.ceph02-hn02.ofencx.service: Consumed 25.730s CPU time.
Nov  1 09:47:23 ceph02-hn02 systemd[1]: Starting Ceph
mgr.ceph02-hn02.ofencx for <ceph-id>...

I am still looking for more information.

Thank you for your answer.

On Wed, Nov 1, 2023 at 5:25 PM Eugen Block <eblock@xxxxxx> wrote:

Hi,

please provide more details about your cluster, especially the 'ceph
-s' output. Is the cluster healthy? Apparently, other ceph commands
work, but you could share the mgr logs anyway, maybe the hive mind
finds something. ;-)
Don't forget to mask sensitive data.

Regards,
Eugen

Zitat von Dario Graña <dgrana@xxxxxx>:

> Hi everyone!
>
> I've a ceph cluster running AlmaLinux9, podman and Ceph Quincy (17.6.0).
> Since yesterday I was having some problems but the last one is that the
> ceph orch command is hanging up. I have seen the logs but I didn't find
> anything relevant that could help to fix the problem.
> Podman shows the daemons running and if I stop one daemon it appears
again
> after a few seconds.
> I also tried *ceph mgr fail* command, daemon order changes, but ceph orch
> still not working. ceph orch pause/resume also are not working. Disabling
> and enabling *cephadm* module didn't help.
>
> Any help to understand what's going on would be welcome.
>
> Thanks in advance.
>
> --
> Dario Graña
> PIC (Port d'Informació Científica)
> Campus UAB, Edificio D
> E-08193 Bellaterra, Barcelona
> http://www.pic.es
> Avis - Aviso - Legal Notice: http://legal.ifae.es
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx