Quoting Massimo Sgaravatto (massimo.sgaravatto@xxxxxxxxx): > After having upgraded my ceph cluster from Luminous to Nautilus 14.2.6 , > from time to time "ceph health detail" claims about some"Long heartbeat > ping times on front/back interface seen". > > As far as I can understand (after having read > https://docs.ceph.com/docs/nautilus/rados/operations/monitoring/), this > means that the ping from one OSD to another one exceeded 1 s. > > I have some questions on these network performance checks > > 1) What is meant exactly with front and back interface ? Do you have a "public" and a "cluster" network? I would expect that the "back" interface is a "cluster" network interface. > 2) I can see the involved OSDs only in the output of "ceph health detail" > (when there is the problem) but I can't find this information in the log > files. In the mon log file I can only see messages such as: > > > 2020-01-28 11:14:07.641 7f618e644700 0 log_channel(cluster) log [WRN] : > Health check failed: Long heartbeat ping times on back interface seen, > longest is 1416.618 msec (OSD_SLOW_PING_TIME_BACK) > > but the involved OSDs are not reported in this log. > Do I just need to increase the verbosity of the mon log ? > > 3) Is 1 s a reasonable value for this threshold ? How could this value be > changed ? What is the relevant configuration variable ? Not sure how much priority Ceph gives to this ping check. But if you're on a 10 Gb/s network I would start complaining when things take longer than 1 ms ... a ping should not take much longer than 0.05 ms so if it would take an order of magnitude longer than expected latency is not optimal. For Gigabit networks I would bump above values by an order of magnitude. Gr. Stefan -- | BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx