MONs unresponsive for excessive amount of time

Frank Schilder <frans@xxxxxx> · Wed, 18 Nov 2020 14:53:10 +0000

Hi all,

one of our MONs was down for maintenance for ca. 45 minutes. After this time I started it up again and it joined the cluster.

Unfortunately, things did not go as expected. The MON sub-cluster became unresponsive for a bit more than 10 minutes. Admin commands would hang, even if issued directly to a specific monitor via "ceph tell mon.xxx". In addition, our MDS lost connection to the MONs and reported a laggy connection. Consequently, all ceph fs access was frozen for a bit more than 10 minutes as well.

>From the little I could get out with "ceph daemon mon.xxx mon_status" I could see that the restarted MON was in state "synchronizing" (or similar, its from memory) while the other mons were in quorum.

Our cluster is mimic-12.2.8. Somehow, this observation does not fit together with the intended HA of the MON cluster, there should not be any stall at all.

My questions: Why do the MONs become unresponsive for such a long time? What are the MONs doing during this time frame? Are there any config options I should look at? Are there any log messages I should hunt for?

Any hint is appreciated.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx