Re: MDS daemons don't report any more

Frank Schilder <frans@xxxxxx> · Tue, 12 Sep 2023 13:43:28 +0000

Hi Patrick,

I'm not sure that its exactly the same issue. I observed that "ceph tell mds.xyz session ls" had all counters 0. On Friday before we had a power loss on a rack that took out a JBOD with a few meta-data disks and I suspect that the reporting of zeroes started after this crash. No hard evidence though.

I uploaded all logs with a bit of explanation to tag 1c022c43-04a7-419d-bdb0-e33c97ef06b8.

I don't have any more detail than that recorded. It was 3 other MDSes that restarted on fail of rank 0, not 5 as I wrote before. The readme.txt contains pointers to find the info in the logs.

We didn't have any user complaints. Therefore, I'm reasonably confident that the file system was actually accessible the whole time (from Friday afternoon until Sunday night when I restarted everything).

I hope you can find something useful.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Patrick Donnelly <pdonnell@xxxxxxxxxx>
Sent: Monday, September 11, 2023 7:51 PM
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: MDS daemons don't report any more

Hello Frank,

On Mon, Sep 11, 2023 at 8:39 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Update: I did a systematic fail of all MDSes that didn't report, starting with the stand-by daemons and continuing from high to low ranks. One by one they started showing up again with version and stats and the fail went as usual with one exception: rank 0.

It might be https://tracker.ceph.com/issues/24403

> The moment I failed rank 0 it took 5 other MDSes down with it.

Can you be more precise about what happened? Can you share logs?

> This is, in fact, the second time I have seen such an event, failing an MDS crashes others. Given the weird observation in my previous e-mail together with what I saw when restarting everything, does this indicate a problem with data integrity or is this an annoying yet harmless bug?

It sounds like an annoyance but certainly one we'd like to track down.
Keep in mind that the "fix" for https://tracker.ceph.com/issues/24403
is not going to Octopus. You need to upgrade and Pacific will soon be
EOL.

> Thanks for any help!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Frank Schilder <frans@xxxxxx>
> Sent: Sunday, September 10, 2023 12:39 AM
> To: ceph-users@xxxxxxx
> Subject:  MDS daemons don't report any more
>
> Hi all, I make a weird observation. 8 out of 12 MDS daemons seem not to report to the cluster any more:
>
> # ceph fs status
> con-fs2 - 1625 clients
> =======
> RANK  STATE     MDS       ACTIVITY     DNS    INOS
>  0    active  ceph-16  Reqs:    0 /s     0      0
>  1    active  ceph-09  Reqs:  128 /s  4251k  4250k
>  2    active  ceph-17  Reqs:    0 /s     0      0
>  3    active  ceph-15  Reqs:    0 /s     0      0
>  4    active  ceph-24  Reqs:  269 /s  3567k  3567k
>  5    active  ceph-11  Reqs:    0 /s     0      0
>  6    active  ceph-14  Reqs:    0 /s     0      0
>  7    active  ceph-23  Reqs:    0 /s     0      0
>         POOL           TYPE     USED  AVAIL
>    con-fs2-meta1     metadata  2169G  7081G
>    con-fs2-meta2       data       0   7081G
>     con-fs2-data       data    1248T  4441T
> con-fs2-data-ec-ssd    data     705G  22.1T
>    con-fs2-data2       data    3172T  4037T
> STANDBY MDS
>   ceph-08
>   ceph-10
>   ceph-12
>   ceph-13
>                                     VERSION                                                                      DAEMONS
>                                       None                                        ceph-16, ceph-17, ceph-15, ceph-11, ceph-14, ceph-23, ceph-10, ceph-12
> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)                    ceph-09, ceph-24, ceph-08, ceph-13
>
> Version is "none" for these and there are no stats. Ceph versions reports only 4 MDSes out of the 12. 8 are not shown at all:
>
> [root@gnosis ~]# ceph versions
> {
>     "mon": {
>         "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5
>     },
>     "mgr": {
>         "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5
>     },
>     "osd": {
>         "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1282
>     },
>     "mds": {
>         "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 4
>     },
>     "overall": {
>         "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1296
>     }
> }
>
> Ceph status reports everything as up and OK:
>
> [root@gnosis ~]# ceph status
>   cluster:
>     id:     e4ece518-f2cb-4708-b00f-b6bf511e91d9
>     health: HEALTH_OK
>
>   services:
>     mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 2w)
>     mgr: ceph-03(active, since 61s), standbys: ceph-25, ceph-01, ceph-02, ceph-26
>     mds: con-fs2:8 4 up:standby 8 up:active
>     osd: 1284 osds: 1282 up (since 31h), 1282 in (since 33h); 567 remapped pgs
>
>   data:
>     pools:   14 pools, 25065 pgs
>     objects: 2.14G objects, 3.7 PiB
>     usage:   4.7 PiB used, 8.4 PiB / 13 PiB avail
>     pgs:     79908208/18438361040 objects misplaced (0.433%)
>              23063 active+clean
>              1225  active+clean+snaptrim_wait
>              317   active+remapped+backfill_wait
>              250   active+remapped+backfilling
>              208   active+clean+snaptrim
>              2     active+clean+scrubbing+deep
>
>   io:
>     client:   596 MiB/s rd, 717 MiB/s wr, 4.16k op/s rd, 3.04k op/s wr
>     recovery: 8.7 GiB/s, 3.41k objects/s
>
> My first thought is that the status module failed. However, I don't manage to restart it (always on). An MGR fail-over did not help.
>
> Any ideas what is going on here?
>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx