Issues after a shutdown

Jeremy Hansen <farnsworth.mcfadden@xxxxxxxxx> · Mon, 25 Jul 2022 03:29:58 -0700

I transitioned some servers to a new rack and now I'm having major issues
with Ceph upon bringing things back up.

I believe the issue may be related to the ceph nodes coming back up with
different IPs before VLANs were set.  That's just a guess because I can't
think of any other reason this would happen.

Current state:

Every 2.0s: ceph -s
   cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022

  cluster:
    id:     bfa2ad58-c049-11eb-9098-3c8cf8ed728d
    health: HEALTH_WARN
            1 filesystem is degraded
            2 MDSs report slow metadata IOs
            2/5 mons down, quorum cn02,cn03,cn01
            9 osds down
            3 hosts (17 osds) down
            Reduced data availability: 97 pgs inactive, 9 pgs down
            Degraded data redundancy: 13860144/30824413 objects degraded
(44.965%), 411 pgs degraded, 482 pgs undersized

  services:
    mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05,
cn04
    mgr: cn02.arszct(active, since 5m)
    mds: 2/2 daemons up, 2 standby
    osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs

  data:
    volumes: 1/2 healthy, 1 recovering
    pools:   8 pools, 545 pgs
    objects: 7.71M objects, 6.7 TiB
    usage:   15 TiB used, 39 TiB / 54 TiB avail
    pgs:     0.367% pgs unknown
             17.431% pgs not active
             13860144/30824413 objects degraded (44.965%)
             1137693/30824413 objects misplaced (3.691%)
             280 active+undersized+degraded
             67  undersized+degraded+remapped+backfilling+peered
             57  active+undersized+remapped
             45  active+clean+remapped
             44  active+undersized+degraded+remapped+backfilling
             18  undersized+degraded+peered
             10  active+undersized
             9   down
             7   active+clean
             3   active+undersized+remapped+backfilling
             2   active+undersized+degraded+remapped+backfill_wait
             2   unknown
             1   undersized+peered

  io:
    client:   170 B/s rd, 0 op/s rd, 0 op/s wr
    recovery: 168 MiB/s, 158 keys/s, 166 objects/s

I have to disable and re-enable the dashboard just to use it.  It seems to
get bogged down after a few moments.

The three servers that were moved to the new rack Ceph has marked as
"Down", but if I do a cephadm host-check, they all seem to pass:

************************ ceph  ************************
--------- cn01.ceph.---------
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
--------- cn02.ceph.---------
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
--------- cn03.ceph.---------
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
--------- cn04.ceph.---------
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
--------- cn05.ceph.---------
podman|docker (/usr/bin/podman) is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
--------- cn06.ceph.---------
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK

It seems to be recovering with what it has left, but a large amount of OSDs
are down.  When trying to restart one of the down'd OSDs, I see a huge dump.

Jul 25 03:19:38 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:19:38.532+0000 7fce14a6c080  0 osd.34 30689 done with init,
starting boot process
Jul 25 03:19:38 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:19:38.532+0000 7fce14a6c080  1 osd.34 30689 start_boot
Jul 25 03:20:10 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:20:10.655+0000 7fcdfd12d700  1 osd.34 30689 start_boot
Jul 25 03:20:41 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:20:41.159+0000 7fcdfd12d700  1 osd.34 30689 start_boot
Jul 25 03:21:11 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:21:11.662+0000 7fcdfd12d700  1 osd.34 30689 start_boot

At this point it just keeps printing start_boot, but the dashboard has it
marked as "in" but "down".

On these three hosts that moved, there were a bunch marked as "out" and
"down", and some with "in" but "down".

Not sure where to go next.  I'm going to let the recovery continue and hope
that my 4x replication on these pools saves me.

Not sure where to go from here.  Any help is very much appreciated.  This
Ceph cluster holds all of our Cloudstack images...  it would be terrible to
lose this data.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx