After having upgraded my ceph cluster from Luminous to Nautilus 14.2.6 , from time to time "ceph health detail" claims about some"Long heartbeat ping times on front/back interface seen". As far as I can understand (after having read https://docs.ceph.com/docs/nautilus/rados/operations/monitoring/), this means that the ping from one OSD to another one exceeded 1 s. I have some questions on these network performance checks 1) What is meant exactly with front and back interface ? 2) I can see the involved OSDs only in the output of "ceph health detail" (when there is the problem) but I can't find this information in the log files. In the mon log file I can only see messages such as: 2020-01-28 11:14:07.641 7f618e644700 0 log_channel(cluster) log [WRN] : Health check failed: Long heartbeat ping times on back interface seen, longest is 1416.618 msec (OSD_SLOW_PING_TIME_BACK) but the involved OSDs are not reported in this log. Do I just need to increase the verbosity of the mon log ? 3) Is 1 s a reasonable value for this threshold ? How could this value be changed ? What is the relevant configuration variable ? 4) https://docs.ceph.com/docs/nautilus/rados/operations/monitoring/ suggests to use the dump_osd_network command. I think there is an error in that page: it says that the command should be issued on ceph-mgr.x.asok, while I think that instead the ceph-osd-x.asok should be used I have an other ceph cluster (running nautilus 14.2.6 as well) where there aren't OSD_SLOW_PING_* error messages in the mon logs, but: ceph daemon /var/run/ceph/ceph-osd..asok dump_osd_network 1 reports a lot of entries (i.e. pings exceeded 1 s). How can this be explained ? Thanks, Massimo _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx