Hi all, on ceph fs kernel clients we see a lot of these kind of messages in bursts: ... [Mon Dec 19 09:43:15 2022] libceph: osd1258 weight 0x10000 (in) [Mon Dec 19 09:43:15 2022] libceph: osd1258 up [Mon Dec 19 09:43:15 2022] libceph: osd1259 weight 0x10000 (in) [Mon Dec 19 09:43:15 2022] libceph: osd1259 up [Mon Dec 19 09:43:16 2022] libceph: osd0 down [Mon Dec 19 09:43:16 2022] libceph: osd0 up [Mon Dec 19 09:43:16 2022] libceph: osd0 down [Mon Dec 19 09:43:16 2022] libceph: osd0 up ... There haven't been osds going up and down since Wednesday last week. However, no OSDs went up/down. What is libceph actually reporting here? The cluster is rebalancing since last Wednesday, when we added new disks. There has not been any daemon down since then: # ceph status cluster: id: ### health: HEALTH_OK services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 12d) mgr: ceph-25(active, since 11w), standbys: ceph-03, ceph-02, ceph-01, ceph-26 mds: con-fs2:8 4 up:standby 8 up:active osd: 1260 osds: 1260 up (since 6d), 1260 in (since 6d); 2342 remapped pgs task status: data: pools: 14 pools, 25065 pgs objects: 1.53G objects, 2.8 PiB usage: 3.4 PiB used, 9.7 PiB / 13 PiB avail pgs: 1158282360/13309135349 objects misplaced (8.703%) 22704 active+clean 2261 active+remapped+backfill_wait 81 active+remapped+backfilling 16 active+clean+snaptrim 3 active+clean+scrubbing+deep io: client: 120 MiB/s rd, 175 MiB/s wr, 1.46k op/s rd, 2.23k op/s wr recovery: 2.3 GiB/s, 850 objects/s We are investigating the reason why some jobs on our HPC cluster get stuck after the job ends. These messages look somewhat suspicious and we wonder if it has anything to do with the ceph client/fs. The cluster has been healthy the whole time. Best regards and thanks for pointers! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx