Re: Network performance checks

Massimo Sgaravatto <massimo.sgaravatto@xxxxxxxxx> · Thu, 30 Jan 2020 09:42:27 +0100

Thanks for your answer

MON-MGR hosts have a mgmt network and a public network.
OSD nodes have instead a mgmt network, a  public network. and a cluster
network
This is what I have in ceph.conf:

public network = 192.168.61.0/24
cluster network = 192.168.222.0/24

public and cluster networks are 10 Gbps networks (actually there is a
single 10 Gbps NIC on each node used for both the public and the cluster
networks).
The mgmt network is a 1 Gbps network, but this one shouldn't be used for
such pings among the OSDs ...

Cheers, Massimo

On Thu, Jan 30, 2020 at 9:26 AM Stefan Kooman <stefan@xxxxxx> wrote:

> Quoting Massimo Sgaravatto (massimo.sgaravatto@xxxxxxxxx):
> > After having upgraded my ceph cluster from Luminous to Nautilus 14.2.6 ,
> > from time to time "ceph health detail" claims about some"Long heartbeat
> > ping times on front/back interface seen".
> >
> > As far as I can understand (after having read
> > https://docs.ceph.com/docs/nautilus/rados/operations/monitoring/), this
> > means that  the ping from one OSD to another one exceeded 1 s.
> >
> > I have some questions on these network performance checks
> >
> > 1) What is meant exactly with front and back interface ?
>
> Do you have a "public" and a "cluster" network? I would expect that the
> "back" interface is a "cluster" network interface.
>
> > 2) I can see the involved OSDs only in the output of "ceph health detail"
> > (when there is the problem) but I can't find this information  in the log
> > files. In the mon log file I can only see messages such as:
> >
> >
> > 2020-01-28 11:14:07.641 7f618e644700  0 log_channel(cluster) log [WRN] :
> > Health check failed: Long heartbeat ping times on back interface seen,
> > longest is 1416.618 msec (OSD_SLOW_PING_TIME_BACK)
> >
> > but the involved OSDs are not reported in this log.
> > Do I just need to increase the verbosity of the mon log ?
> >
> > 3) Is 1 s a reasonable value for this threshold ? How could this value be
> > changed ? What is the relevant configuration variable ?
>
> Not sure how much priority Ceph gives to this ping check. But if you're
> on a 10 Gb/s network I would start complaining when things take longer
> than 1 ms ... a ping should not take much longer than 0.05 ms so if it
> would take an order of magnitude longer than expected latency is not
> optimal.
>
> For Gigabit networks I would bump above values by an order of magnitude.
>
> Gr. Stefan
>
> --
> | BIT BV  https://www.bit.nl/        Kamer van Koophandel 09090351
> | GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx