Re: CephFS in the wild

Christian Balzer <chibi@xxxxxxx> · Fri, 3 Jun 2016 12:36:55 +0900

On Thu, 2 Jun 2016 21:13:41 -0500 Brady Deetz wrote:

> On Thu, Jun 2, 2016 at 8:58 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> > On Thu, 2 Jun 2016 11:11:19 -0500 Brady Deetz wrote:
> >
> > > On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer <chibi@xxxxxxx>
> > > wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote:
> > > >
[snip]
> > > > > Planned Architecture:
> > > > >
> > > > Well, we talked about this 2 months ago, you seem to have changed
> > > > only a few things.
> > > > So lets dissect this again...
> > > >
> > > > > Storage Interconnect:
> > > > > Brocade VDX 6940 (40 gig)
> > > > >
> > > > Is this a flat (single) network for all the storage nodes?
> > > > And then from these 40Gb/s switches links to the access switches?
> > > >
> > >
> > > This will start as a single 40Gb/s switch with a single link to each
> > > node (upgraded in the future to dual-switch + dual-link). The 40Gb/s
> > > switch will also be connected to several 10Gb/s and 1Gb/s access
> > > switches with dual 40Gb/s uplinks.
> > >
> > So initially 80Gb/s and with the 2nd switch probably 160Gb/s for your
> > clients.
> > Network wise, your 8 storage servers outstrip that, actual storage
> > bandwidth and IOPS wise, you're looking at 8x2GB/s aka 160Gb/s best
> > case writes, so a match.
> >
> > > We do intend to segment the public and private networks using VLANs
> > > untagged at the node. There are obviously many subnets on our
> > > network. The 40Gb/s switch will handle routing for those networks.
> > >
> > > You can see list discussion in "Public and Private network over 1
> > > interface" May 23,2016 regarding some of this.
> > >
> > And I did comment in that thread, the final one actually. ^o^
> >
> > Unless you can come up with a _very_ good reason not covered in that
> > thread, I'd keep it to one network.
> >
> > Once the 2nd switch is in place and running vLAG (LACP on your servers)
> > your network bandwidth per host VASTLY exceeds that of your storage.
> >
> >
> My theory is that with a single switch, I can QoS traffic for the private
> network in case of the situation where we do see massive client I/O at
> the same time that a re-weight or something like that was happening.
> But... I think you're right. KISS
> 
Lets run with this example:

1. You just lost a NMVe (cosmic rays, dontcha know) and 12 of your OSDs
are toast.

2. Ceph does its thing and kicks off all that recovery and backfill magic
(terror would be a better term).

3. Your clients at this point also would like to read (just read for
simplicity, R/W would make it worse of course) at the max speed of your
initial network layout, that is 8GB/s 

4. As stated your nodes can't write more than 2GB/s, which in turn also
means that recovery/backfill traffic from another node (reads) can't
exceed this value. (Wrongly assuming equal distribution of activity
per node, but this will be correct cluster wide)
Leaving 2GB/s per node (or 16GB/s total) of read bandwidth.
So from a network perspective you should have no need for QoS at all, ever.

This of course leaves out the pertinent detail that all this activity will
result in severely degraded performance due to the thrashing HDDs with
default parameters.
So your clients will be hobbled by your storage, not your network.

And if you tuned down things so that recovery/backfill have the least
possible impact on your client I/O, that in turn also means vastly
reduced network needs.

> My initial KISS thought was single network was the opposite due to the
> alternate and maybe less tested configuration of Ceph. Perhaps
> multi-netting is a better compromise. We still run 2 networks, but not
> over separate VLANs.
> 
If you look at that other thread, you will find that many people run and
prefer single networks.
Just because it's in the documentation and an option doesn't mean it's the
best/correct approach.

I use split networks in exactly one cluster, my shitty test one which has
2 1Gb/s ports per node and _more_ IO bandwidth per node than a single
link. 

> Terrible idea?
> 
More to the tune of pointless.

Christian

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com