Re: how to judge the results? - rados bench comparison

Stefan Kooman <stefan@xxxxxx> · Wed, 17 Apr 2019 11:16:02 +0200

Quoting Lars Täuber (taeuber@xxxxxxx):
> > > This is something i was told to do, because a reconstruction of failed
> > > OSDs/disks would have a heavy impact on the backend network.  
> > 
> > Opinions vary on running "public" only versus "public" / "backend".
> > Having a separate "backend" network might lead to difficult to debug
> > issues when the "public" network is working fine, but the "backend" is
> > having issues and OSDs can't peer with each other, while the clients can
> > talk to all OSDs. You will get slow requests and OSDs marking each other
> > down while they are still running etc.
> 
> This I was not aware of.

It's real. I've been bitten by this several times in a PoC cluster while
playing around with networking ... make sure you have proper monitoring checks on
all network interfaces when running this setup.

> > In your case with only 6 spinners max per server there is no way you
> > will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s
> > (for large spinners) should be just enough to fill a 10 Gb/s link. A
> > redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for
> > both OSD replication traffic and client IO.
> 
> The reason for the choice for the 25GBit network was because a remark
> of someone, that the latency in this ethernet is way below that of
> 10GBit. I never double checked this.

This is probably true. 25 Gb/s is a single-lane (SerDes) which is used in 50 Gb/s
/ 100 Gb/s 200 Gb/s connections. It operates on ~ 2.5 times the clock
rate of 10 Gb/s / 40 Gb/s. But for clients to fully benefit from this lower
latency, they should be on 25 Gb/s as well. If you can affort to
redesign your cluster (and low latency is important) ...  Then again ...
the latency your spinners introduce is a few orders of magnitude higher
than the network latency ... I would then (also) invest in NVMe drives
for (at least) metadata ... and switch to 3 x replication ... but that
might be too much asked for.

TL;DR: when desinging clusters, try to think about the "weakest" link
(bottleneck) ... most probably this will be disk speed / Ceph overhead.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com