Hi all, We are running the Nautilus cluster. Today due to UPS work, we shut down the whole cluster. After we start the cluster, many OSDs go down and they seem to start doing the heardbeat_check using the public network. For example, we see the following logs: --- 2023-05-16 19:35:29.254 7efcd4ce7700 -1 osd.101 42916 heartbeat_check: no reply from 131.174.45.223:6825 osd.185 ever on either front or back, first ping sent 2023-05-16 19:34:48.593701 (oldest deadline 2023-05-16 19:35:08.593701) --- While I was expect the heardbeat to go through the cluster network, e.g. instead of 131.174.45.223, it should use 172.20.128.223. In fact, when we start up the cluster, we don't have DNS available to resolve the IP addresses, and for a short while, all OSDs are located in a new host called "localhost.localdomain". At that point, I fixed it by setting the static hostname using `hostnamectl set -hostname xxx`. Now we cannot bring the cluster back to healthy state. We are stuck at: --- cluster: id: 86c9bc85-b7f3-49a1-9e1f-8c9f2b31fca8 health: HEALTH_ERR 1 filesystem is degraded 1 filesystem has a failed mds daemon 1 filesystem is offline insufficient standby MDS daemons available pauserd,pausewr,noout,nobackfill,norebalance,norecover flag(s) set 88 osds down Reduced data availability: 2544 pgs inactive, 2369 pgs down, 159 pgs peering, 294 pgs stale Degraded data redundancy: 870424/2714593746 objects degraded (0.032%), 30 pgs degraded, 9 pgs undersized 8631 slow ops, oldest one blocked for 803 sec, mon.ceph- mon01 has slow ops services: mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 (age 2h) mgr: ceph-mon02(active, since 2h), standbys: ceph-mon01, ceph-mon03 mds: cephfs:0/1, 1 failed osd: 191 osds: 103 up (since 1.46827s), 191 in (since 4w) flags pauserd,pausewr,noout,nobackfill,norebalance,norecover data: pools: 2 pools, 2560 pgs objects: 456.13M objects, 657 TiB usage: 1.1 PiB used, 638 TiB / 1.7 PiB avail pgs: 100.000% pgs not active 870424/2714593746 objects degraded (0.032%) 2087 down 282 stale+down 129 peering 32 stale+peering 30 undersized+degraded+peered --- Any idea we could fix it and get the OSDs to use cluster network to do heartbeat checks. Any help would be highly appreciated. Thank you very much. Cheers, Hong -- Hurng-Chun (Hong) Lee, PhD ICT manager Donders Institute for Brain, Cognition and Behaviour, Centre for Cognitive Neuroimaging Radboud University Nijmegen e-mail: h.lee@xxxxxxxxxxxxx tel: +31(0)631132518 web: http://www.ru.nl/donders/ pgp: 3AC505B2B787A8ABE2C551B1362976D838ABF09E * Mon, Tue and Thu at Trigon; Wed and Fri working from home _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx