Re: MDS daemons don't report any more

Frank Schilder <frans@xxxxxx> · Mon, 11 Sep 2023 12:39:10 +0000

Update: I did a systematic fail of all MDSes that didn't report, starting with the stand-by daemons and continuing from high to low ranks. One by one they started showing up again with version and stats and the fail went as usual with one exception: rank 0. The moment I failed rank 0 it took 5 other MDSes down with it.

This is, in fact, the second time I have seen such an event, failing an MDS crashes others. Given the weird observation in my previous e-mail together with what I saw when restarting everything, does this indicate a problem with data integrity or is this an annoying yet harmless bug?

Thanks for any help!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Sunday, September 10, 2023 12:39 AM
To: ceph-users@xxxxxxx
Subject:  MDS daemons don't report any more

Hi all, I make a weird observation. 8 out of 12 MDS daemons seem not to report to the cluster any more:

# ceph fs status
con-fs2 - 1625 clients
=======
RANK  STATE     MDS       ACTIVITY     DNS    INOS
 0    active  ceph-16  Reqs:    0 /s     0      0
 1    active  ceph-09  Reqs:  128 /s  4251k  4250k
 2    active  ceph-17  Reqs:    0 /s     0      0
 3    active  ceph-15  Reqs:    0 /s     0      0
 4    active  ceph-24  Reqs:  269 /s  3567k  3567k
 5    active  ceph-11  Reqs:    0 /s     0      0
 6    active  ceph-14  Reqs:    0 /s     0      0
 7    active  ceph-23  Reqs:    0 /s     0      0
        POOL           TYPE     USED  AVAIL
   con-fs2-meta1     metadata  2169G  7081G
   con-fs2-meta2       data       0   7081G
    con-fs2-data       data    1248T  4441T
con-fs2-data-ec-ssd    data     705G  22.1T
   con-fs2-data2       data    3172T  4037T
STANDBY MDS
  ceph-08
  ceph-10
  ceph-12
  ceph-13
                                    VERSION                                                                      DAEMONS
                                      None                                        ceph-16, ceph-17, ceph-15, ceph-11, ceph-14, ceph-23, ceph-10, ceph-12
ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)                    ceph-09, ceph-24, ceph-08, ceph-13

Version is "none" for these and there are no stats. Ceph versions reports only 4 MDSes out of the 12. 8 are not shown at all:

[root@gnosis ~]# ceph versions
{
    "mon": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5
    },
    "mgr": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5
    },
    "osd": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1282
    },
    "mds": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 4
    },
    "overall": {
        "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1296
    }
}

Ceph status reports everything as up and OK:

[root@gnosis ~]# ceph status
  cluster:
    id:     e4ece518-f2cb-4708-b00f-b6bf511e91d9
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 2w)
    mgr: ceph-03(active, since 61s), standbys: ceph-25, ceph-01, ceph-02, ceph-26
    mds: con-fs2:8 4 up:standby 8 up:active
    osd: 1284 osds: 1282 up (since 31h), 1282 in (since 33h); 567 remapped pgs

  data:
    pools:   14 pools, 25065 pgs
    objects: 2.14G objects, 3.7 PiB
    usage:   4.7 PiB used, 8.4 PiB / 13 PiB avail
    pgs:     79908208/18438361040 objects misplaced (0.433%)
             23063 active+clean
             1225  active+clean+snaptrim_wait
             317   active+remapped+backfill_wait
             250   active+remapped+backfilling
             208   active+clean+snaptrim
             2     active+clean+scrubbing+deep

  io:
    client:   596 MiB/s rd, 717 MiB/s wr, 4.16k op/s rd, 3.04k op/s wr
    recovery: 8.7 GiB/s, 3.41k objects/s

My first thought is that the status module failed. However, I don't manage to restart it (always on). An MGR fail-over did not help.

Any ideas what is going on here?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx