Re: Osds going down/flapping after Luminous to Nautilus upgrade part 2

Mark Kirkwood <markkirkwood@xxxxxxxxxxxxxxxx> · Thu, 1 Aug 2024 13:49:38 +1200

On 1/08/24 12:56, Mark Kirkwood wrote:

This 2nd post is about the next type of flapping osds we encountered 
after upgrading. We started to see osds going down with this in 'ceph 
-w':

2024-08-01 12:02:57.437135 mon.cat-hlz-stor001 [INF] osd.479 marked 
down after no beacon for 902.637005 seconds
2024-08-01 12:02:57.468372 mon.cat-hlz-stor001 [WRN] Health check 
failed: 1 osds down (OSD_DOWN)

We have the beacon interval set to 300. To fix this we tried:

- restarting osds
- restarting mons
- ntp tidyup
- restarting mgrs

However it is still happening.  Poking around in the osd and mon logs 
we did see some lines that hinted that the mon might be listening for 
beacons using v1 - which could be broken (see part 1). Hence 
restarting them again. This did not have any effect.

Apart from enabling the v2 msgr we have not altered our Luminous 
config for Nautilus, are we missing something?

Regards

Mark

We noticed that collectd was running 'ceph perf dump' regularly on the 
Ceph hosts and it appears we are seeing a version of this bug:

https://tracker.ceph.com/issues/25211

We've stopped collectd doing this and so far no flapping osds and no 'no 
beacon for...' log messages.

Looking at the Nautilus branch code, it appears to have the fix for 
issue 25211, but we'll dig a bit more.

Regards

Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx