Hi, This type of incident is often resolved by setting the public_network option to the "global" scope, in the configuration: ceph config set global public_network a:b:c:d::/64 Le ven. 21 juin 2024 à 03:36, Eugen Block <eblock@xxxxxx> a écrit : > Hi, > > this only a theory, not a proven answer or something. But the > orchestrator does automatically reconfigure daemons depending on the > circumstances. So my theory is, some of the OSD nodes didn't respond > via public network anymore, so ceph tried to use the cluster network > as a fallback. The other way around is more common: if you don't have > a cluster network configured at all, you see logs stating "falling > back to public interface" (or similar). If the orchestrator did > reconfigure the daemons, it would have been logged in the active mgr. > And the result would be a different ceph.conf for the daemons in > /var/lib/ceph/{FSID}/osd.{OSD_ID}/config. If you still have the mgr > logs from after the outage you might find some clues. > > Regards, > Eugen > > Zitat von mailing-lists <mailing-lists@xxxxxxxxx>: > > > Dear Cephers, > > > > after a succession of unfortunate events, we have suffered a > > complete datacenter blackout today. > > > > > > Ceph _nearly_ perfectly came back up. The Health was OK and all > > services were online, but we were having weird problems. Weird as > > in, we could sometimes map rbds and sometimes not, and sometimes we > > could use the cephfs and sometimes we could not... > > > > Turns out, some osds (id say 5%) came back with the cluster_ip > > address as their public_ip and thus were not reachable. > > > > I do not see any pattern, why some osds are faulty and others are > > not. The fault is spread over nearly all nodes. This is an example: > > > > osd.45 up in weight 1 up_from 184143 up_thru 184164 down_at > > 184142 last_clean_interval [182655,184103) > > [v2:192.168.222.20:6834/1536394698,v1:192.168.222.20:6842/1536394698] > > [v2:192.168.222.20:6848/1536394698,v1:192.168.222.20:6853/1536394698] > > exists,up 002326c9 > > > > This should have a public_ip in the first brackets []. Our > > cluster-network is 192.168.222.0/24, which is of course only > > available on the ceph internal switch. > > > > Simply restarting the osds that were affected solved this problem... > > So I am not really asking for your help troubleshooting this; I > > would just like to understand if there is a reasonable explanation. > > > > My guess would be some kind of race-condition when the interfaces > > came up, but then again, why on ~5% of all osds? ... Anyways im > > tired, I hope that this mail is somewhat understandable. > > > > > > We are running Ceph 17.2.7 with cephadm docker deployment. > > > > > > If you have any ideas for the cause of this, please let me know. I > > have not seen this issue when I'm gracefully rebooting the nodes. > > > > > > Best > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx