Ceph OSDs suddenly use public network for heardbeat_check

"Lee, H. (Hurng-Chun)" <hurng-chun.lee@xxxxxxxxxxxxx> · Tue, 16 May 2023 17:46:40 +0000

Hi all,

We are running the Nautilus cluster.  Today due to UPS work, we shut
down the whole cluster.

After we start the cluster, many OSDs go down and they seem to start
doing the heardbeat_check using the public network.  For example, we
see the following logs:

---
2023-05-16 19:35:29.254 7efcd4ce7700 -1 osd.101 42916 heartbeat_check:
no reply from 131.174.45.223:6825 osd.185 ever on either front or back,
first ping sent 2023-05-16 19:34:48.593701 (oldest deadline 2023-05-16
19:35:08.593701)
---

While I was expect the heardbeat to go through the cluster network,
e.g. instead of 131.174.45.223, it should use 172.20.128.223.

In fact, when we start up the cluster, we don't have DNS available to
resolve the IP addresses, and for a short while, all OSDs are located
in a new host called "localhost.localdomain".  At that point, I fixed
it by setting the static hostname using `hostnamectl set  -hostname
xxx`.

Now we cannot bring the cluster back to healthy state.  We are stuck
at:

---
  cluster:
    id:     86c9bc85-b7f3-49a1-9e1f-8c9f2b31fca8
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem has a failed mds daemon
            1 filesystem is offline
            insufficient standby MDS daemons available
            pauserd,pausewr,noout,nobackfill,norebalance,norecover
flag(s) set
            88 osds down
            Reduced data availability: 2544 pgs inactive, 2369 pgs
down, 159 pgs peering, 294 pgs stale
            Degraded data redundancy: 870424/2714593746 objects
degraded (0.032%), 30 pgs degraded, 9 pgs undersized
            8631 slow ops, oldest one blocked for 803 sec, mon.ceph-
mon01 has slow ops

  services:
    mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 (age 2h)
    mgr: ceph-mon02(active, since 2h), standbys: ceph-mon01, ceph-mon03
    mds: cephfs:0/1, 1 failed
    osd: 191 osds: 103 up (since 1.46827s), 191 in (since 4w)
         flags pauserd,pausewr,noout,nobackfill,norebalance,norecover

  data:
    pools:   2 pools, 2560 pgs
    objects: 456.13M objects, 657 TiB
    usage:   1.1 PiB used, 638 TiB / 1.7 PiB avail
    pgs:     100.000% pgs not active
             870424/2714593746 objects degraded (0.032%)
             2087 down
             282  stale+down
             129  peering
             32   stale+peering
             30   undersized+degraded+peered
---

Any idea we could fix it and get the OSDs to use cluster network to do
heartbeat checks.  Any help would be highly appreciated.  Thank you
very much.

Cheers, Hong

-- 
Hurng-Chun (Hong) Lee, PhD
ICT manager

Donders Institute for Brain, Cognition and Behaviour, 
Centre for Cognitive Neuroimaging
Radboud University Nijmegen

e-mail: h.lee@xxxxxxxxxxxxx
tel: +31(0)631132518
web: http://www.ru.nl/donders/
pgp: 3AC505B2B787A8ABE2C551B1362976D838ABF09E

* Mon, Tue and Thu at Trigon; Wed and Fri working from home

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx