constant increase in osdmap epoch

Frank Schilder <frans@xxxxxx> · Mon, 18 Nov 2024 14:55:28 +0000

Hi all,

we observe a problem that has been reported before, but I can't find a resolution. This is related to an earlier thread "failed to load OSD map for epoch 2898146, got 0 bytes" (https://www.spinics.net/lists/ceph-users/msg84485.html).

We run an octopus latest cluster and observe a constant increase in osdmap epoch every few seconds. There is no change in the contents between two successive epochs:

# diff map.3075085.txt map.3075086.txt
1c1
< epoch 3075085
---
> epoch 3075086
4c4
< modified 2024-11-18T15:38:45.512100+0100
---
> modified 2024-11-18T15:38:47.858092+0100

This is exactly what others reported too, for example, "steady increasing of osd map epoch since octopus" (https://www.spinics.net/lists/ceph-users/msg69443.html). Its a real problem since it dramatically shortens the time window an OSD can be down before its latest OSD map is purged from the cluster. This, in turn, leads to serious follow-up problems with OSD restart as reported in the thread I'm referring to at the beginning.

Related to that I also see the mgrs increasing the pgmap version constantly every 2 seconds. However, I believe this is intentional.

I don't see this redundant pgp_num_actual setting by the mgrs reported here: https://tracker.ceph.com/issues/51433 .

I can't find a resolution anywhere. Any help would be very much appreciated.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx