Update: I did a systematic fail of all MDSes that didn't report, starting with the stand-by daemons and continuing from high to low ranks. One by one they started showing up again with version and stats and the fail went as usual with one exception: rank 0. The moment I failed rank 0 it took 5 other MDSes down with it. This is, in fact, the second time I have seen such an event, failing an MDS crashes others. Given the weird observation in my previous e-mail together with what I saw when restarting everything, does this indicate a problem with data integrity or is this an annoying yet harmless bug? Thanks for any help! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: Sunday, September 10, 2023 12:39 AM To: ceph-users@xxxxxxx Subject: MDS daemons don't report any more Hi all, I make a weird observation. 8 out of 12 MDS daemons seem not to report to the cluster any more: # ceph fs status con-fs2 - 1625 clients ======= RANK STATE MDS ACTIVITY DNS INOS 0 active ceph-16 Reqs: 0 /s 0 0 1 active ceph-09 Reqs: 128 /s 4251k 4250k 2 active ceph-17 Reqs: 0 /s 0 0 3 active ceph-15 Reqs: 0 /s 0 0 4 active ceph-24 Reqs: 269 /s 3567k 3567k 5 active ceph-11 Reqs: 0 /s 0 0 6 active ceph-14 Reqs: 0 /s 0 0 7 active ceph-23 Reqs: 0 /s 0 0 POOL TYPE USED AVAIL con-fs2-meta1 metadata 2169G 7081G con-fs2-meta2 data 0 7081G con-fs2-data data 1248T 4441T con-fs2-data-ec-ssd data 705G 22.1T con-fs2-data2 data 3172T 4037T STANDBY MDS ceph-08 ceph-10 ceph-12 ceph-13 VERSION DAEMONS None ceph-16, ceph-17, ceph-15, ceph-11, ceph-14, ceph-23, ceph-10, ceph-12 ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) ceph-09, ceph-24, ceph-08, ceph-13 Version is "none" for these and there are no stats. Ceph versions reports only 4 MDSes out of the 12. 8 are not shown at all: [root@gnosis ~]# ceph versions { "mon": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5 }, "mgr": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5 }, "osd": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1282 }, "mds": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 4 }, "overall": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1296 } } Ceph status reports everything as up and OK: [root@gnosis ~]# ceph status cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_OK services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 2w) mgr: ceph-03(active, since 61s), standbys: ceph-25, ceph-01, ceph-02, ceph-26 mds: con-fs2:8 4 up:standby 8 up:active osd: 1284 osds: 1282 up (since 31h), 1282 in (since 33h); 567 remapped pgs data: pools: 14 pools, 25065 pgs objects: 2.14G objects, 3.7 PiB usage: 4.7 PiB used, 8.4 PiB / 13 PiB avail pgs: 79908208/18438361040 objects misplaced (0.433%) 23063 active+clean 1225 active+clean+snaptrim_wait 317 active+remapped+backfill_wait 250 active+remapped+backfilling 208 active+clean+snaptrim 2 active+clean+scrubbing+deep io: client: 596 MiB/s rd, 717 MiB/s wr, 4.16k op/s rd, 3.04k op/s wr recovery: 8.7 GiB/s, 3.41k objects/s My first thought is that the status module failed. However, I don't manage to restart it (always on). An MGR fail-over did not help. Any ideas what is going on here? Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx