Hi Patrick, I'm not sure that its exactly the same issue. I observed that "ceph tell mds.xyz session ls" had all counters 0. On Friday before we had a power loss on a rack that took out a JBOD with a few meta-data disks and I suspect that the reporting of zeroes started after this crash. No hard evidence though. I uploaded all logs with a bit of explanation to tag 1c022c43-04a7-419d-bdb0-e33c97ef06b8. I don't have any more detail than that recorded. It was 3 other MDSes that restarted on fail of rank 0, not 5 as I wrote before. The readme.txt contains pointers to find the info in the logs. We didn't have any user complaints. Therefore, I'm reasonably confident that the file system was actually accessible the whole time (from Friday afternoon until Sunday night when I restarted everything). I hope you can find something useful. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Patrick Donnelly <pdonnell@xxxxxxxxxx> Sent: Monday, September 11, 2023 7:51 PM To: Frank Schilder Cc: ceph-users@xxxxxxx Subject: Re: Re: MDS daemons don't report any more Hello Frank, On Mon, Sep 11, 2023 at 8:39 AM Frank Schilder <frans@xxxxxx> wrote: > > Update: I did a systematic fail of all MDSes that didn't report, starting with the stand-by daemons and continuing from high to low ranks. One by one they started showing up again with version and stats and the fail went as usual with one exception: rank 0. It might be https://tracker.ceph.com/issues/24403 > The moment I failed rank 0 it took 5 other MDSes down with it. Can you be more precise about what happened? Can you share logs? > This is, in fact, the second time I have seen such an event, failing an MDS crashes others. Given the weird observation in my previous e-mail together with what I saw when restarting everything, does this indicate a problem with data integrity or is this an annoying yet harmless bug? It sounds like an annoyance but certainly one we'd like to track down. Keep in mind that the "fix" for https://tracker.ceph.com/issues/24403 is not going to Octopus. You need to upgrade and Pacific will soon be EOL. > Thanks for any help! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans@xxxxxx> > Sent: Sunday, September 10, 2023 12:39 AM > To: ceph-users@xxxxxxx > Subject: MDS daemons don't report any more > > Hi all, I make a weird observation. 8 out of 12 MDS daemons seem not to report to the cluster any more: > > # ceph fs status > con-fs2 - 1625 clients > ======= > RANK STATE MDS ACTIVITY DNS INOS > 0 active ceph-16 Reqs: 0 /s 0 0 > 1 active ceph-09 Reqs: 128 /s 4251k 4250k > 2 active ceph-17 Reqs: 0 /s 0 0 > 3 active ceph-15 Reqs: 0 /s 0 0 > 4 active ceph-24 Reqs: 269 /s 3567k 3567k > 5 active ceph-11 Reqs: 0 /s 0 0 > 6 active ceph-14 Reqs: 0 /s 0 0 > 7 active ceph-23 Reqs: 0 /s 0 0 > POOL TYPE USED AVAIL > con-fs2-meta1 metadata 2169G 7081G > con-fs2-meta2 data 0 7081G > con-fs2-data data 1248T 4441T > con-fs2-data-ec-ssd data 705G 22.1T > con-fs2-data2 data 3172T 4037T > STANDBY MDS > ceph-08 > ceph-10 > ceph-12 > ceph-13 > VERSION DAEMONS > None ceph-16, ceph-17, ceph-15, ceph-11, ceph-14, ceph-23, ceph-10, ceph-12 > ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) ceph-09, ceph-24, ceph-08, ceph-13 > > Version is "none" for these and there are no stats. Ceph versions reports only 4 MDSes out of the 12. 8 are not shown at all: > > [root@gnosis ~]# ceph versions > { > "mon": { > "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5 > }, > "mgr": { > "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5 > }, > "osd": { > "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1282 > }, > "mds": { > "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 4 > }, > "overall": { > "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1296 > } > } > > Ceph status reports everything as up and OK: > > [root@gnosis ~]# ceph status > cluster: > id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 > health: HEALTH_OK > > services: > mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 2w) > mgr: ceph-03(active, since 61s), standbys: ceph-25, ceph-01, ceph-02, ceph-26 > mds: con-fs2:8 4 up:standby 8 up:active > osd: 1284 osds: 1282 up (since 31h), 1282 in (since 33h); 567 remapped pgs > > data: > pools: 14 pools, 25065 pgs > objects: 2.14G objects, 3.7 PiB > usage: 4.7 PiB used, 8.4 PiB / 13 PiB avail > pgs: 79908208/18438361040 objects misplaced (0.433%) > 23063 active+clean > 1225 active+clean+snaptrim_wait > 317 active+remapped+backfill_wait > 250 active+remapped+backfilling > 208 active+clean+snaptrim > 2 active+clean+scrubbing+deep > > io: > client: 596 MiB/s rd, 717 MiB/s wr, 4.16k op/s rd, 3.04k op/s wr > recovery: 8.7 GiB/s, 3.41k objects/s > > My first thought is that the status module failed. However, I don't manage to restart it (always on). An MGR fail-over did not help. > > Any ideas what is going on here? > > Thanks and best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx