Hi Eugen, thanks! I think this explains our observation. Thanks and merry Christmas! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock@xxxxxx> Sent: 21 December 2022 14:03:06 To: ceph-users@xxxxxxx Subject: Re: libceph: osdXXX up/down all the time Hi Frank, I asked the same question 4 years ago [1]. Basically, Gregs reponse was: > So, this is actually just noisy logging from the client processing > an OSDMap. That should probably be turned down, as it's not really > an indicator of...anything...as far as I can tell. IIRC clients sometimes notice changes in the osdmap with some delay (if they didn't require to update the osdmap), so I just ignore these messages if the cluster is healthy otherwise and the clients work as expected. And here was my conclusion [2]. [1] https://www.spinics.net/lists/ceph-users/msg47279.html [2] https://www.spinics.net/lists/ceph-users/msg47502.html Zitat von Frank Schilder <frans@xxxxxx>: > Hi all, > > on ceph fs kernel clients we see a lot of these kind of messages in bursts: > > ... > [Mon Dec 19 09:43:15 2022] libceph: osd1258 weight 0x10000 (in) > [Mon Dec 19 09:43:15 2022] libceph: osd1258 up > [Mon Dec 19 09:43:15 2022] libceph: osd1259 weight 0x10000 (in) > [Mon Dec 19 09:43:15 2022] libceph: osd1259 up > [Mon Dec 19 09:43:16 2022] libceph: osd0 down > [Mon Dec 19 09:43:16 2022] libceph: osd0 up > [Mon Dec 19 09:43:16 2022] libceph: osd0 down > [Mon Dec 19 09:43:16 2022] libceph: osd0 up > ... > > There haven't been osds going up and down since Wednesday last week. > However, no OSDs went up/down. What is libceph actually reporting > here? > > The cluster is rebalancing since last Wednesday, when we added new > disks. There has not been any daemon down since then: > > # ceph status > cluster: > id: ### > health: HEALTH_OK > > services: > mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 12d) > mgr: ceph-25(active, since 11w), standbys: ceph-03, ceph-02, > ceph-01, ceph-26 > mds: con-fs2:8 4 up:standby 8 up:active > osd: 1260 osds: 1260 up (since 6d), 1260 in (since 6d); 2342 remapped pgs > > task status: > > data: > pools: 14 pools, 25065 pgs > objects: 1.53G objects, 2.8 PiB > usage: 3.4 PiB used, 9.7 PiB / 13 PiB avail > pgs: 1158282360/13309135349 objects misplaced (8.703%) > 22704 active+clean > 2261 active+remapped+backfill_wait > 81 active+remapped+backfilling > 16 active+clean+snaptrim > 3 active+clean+scrubbing+deep > > io: > client: 120 MiB/s rd, 175 MiB/s wr, 1.46k op/s rd, 2.23k op/s wr > recovery: 2.3 GiB/s, 850 objects/s > > We are investigating the reason why some jobs on our HPC cluster get > stuck after the job ends. These messages look somewhat suspicious > and we wonder if it has anything to do with the ceph client/fs. > > The cluster has been healthy the whole time. > > Best regards and thanks for pointers! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx