Re: 800TB - Ceph Physical Architecture Proposal

Christian Balzer <chibi@xxxxxxx> · Sat, 9 Apr 2016 15:11:39 +0900

[re-added the ML]
On Fri, 8 Apr 2016 08:30:21 -0500 Brady Deetz wrote:

> On Thu, Apr 7, 2016 at 9:47 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> >
> > Hello,
> >
> > On Thu, 7 Apr 2016 09:24:46 -0500 Brady Deetz wrote:
> >
> > > I'd appreciate any critique on the following plan.
> > >
> > > Before I detail the plan, here are my current questions.
> > Traditionally they come below the plan/details. ^o^
> >
> > > -----------------------------------------------------------
> > > 1) Am I under-powering the CPU on the proposed OSD node
> > > configuration?
> > >
> > Massively.
> >
> > > 2) Will latency of roughly 300 micro seconds introduced by 10g-base-t
> > > deliver appreciably worse write performance than the approximate 850
> > > nano seconds of latency introduced by 10 or 40g fiber?
> > >
> > Where's that number coming from?
> > I was under the impression that the physical cable plays a minor role
> > when compared to the capabilities of the NICs/switches.
> > That said, latency is bad and to be avoided at "all" cost.
> >
> 
> Latency was coming from the documentation for the various switches I had
> picked. In this case, Brocade VDX 67xx and 69xx. Mellanox Ethernet and
> Infiniband are also enticing. Sounds like 10g-base-t is probably ruled
> out due to more than twice as much switch latency.
> 
Brocade has some nice stuff, we use quite a bit of their gear (that's why I
mentioned vLAG). 
Alas it is massively pricey when compared to Cumulus based gear like the
Penguin Computing switches, which we will test drive later this month.
That claims to have 400ns latency, btw. ^o^

> 
> >
> > > 3) I currently don't have a cache tier built into this design. What
> > > would be the best approach to adding an SSD cache tier at this scale?
> > >
> > Winning the lottery first.
> > Failing that, cache tiering with Jewel and beyond can significantly
> > improve things, as long as your truly hot objects (those that get
> > written to "constantly") can fit in your SSD cache pool.
> > This will be helped by working read-recency or the readforward cache
> > mode (only writes will promote objects into the cache pool, reads come
> > from the base pool if the object isn't already in the cache pool).
> > From what you write below I would aim for at least 8TB net data
> > capacity in such a cache pool.
> > If you were to use dedicated storage nodes for this, THEN a
> > significantly faster uplink and interconnect compared to your HDD OSD
> > nodes would be advisable.
> >
> 
> My question was more along the lines of: Should I build my cache tier
> into my existing OSD nodes or should I have dedicated nodes for caching?
> I was thinking about replacing 4 of the 6TB spindles in each OSD node
> with 4x 400GB Intel DC S3710.
> 
It really depends, as I mentioned several times in the past.

Dedicated nodes have the advantage that you can build them more to spec
(fast CPUs mostly) and you don't need to change your config to allow for 2
CRUSH roots on the same HW.
But as mentioned above, at the very least all/most client writes will go
through the cache pool and thusly you need to give them enough network
bandwidth to handle the expected traffic for your whole cluster.
Not an issue for my use case (and many others~ where IOPS are infinitely
more important than throughput (which at 2-4GB/s max isn't shabby either),
but YMMV.
Lastly dedicated cache nodes you can also grow more easily in case you
don't really need more HDD based OSDs.

The biggest advantage of shared nodes is of course that they scale out
much better (accumulated bandwidth).
Alas that comes at the price of having to put more/faster cores into them
and give them all faster network gear than what a HDD based node would
need.

[snip]
> > > 11 OSD nodes:
> > > -SuperMicro 6047R-E1R36L
> > > --2x E5-2603v2
> > Vastly underpowered for 36 OSDs.
> > > --128GB RAM
> > > --36x 6TB OSD
> > > --2x Intel P3700 (journals)
> > Which exact model?
> > If it's the 400GB one, that's 2GB/s maximum write speed combined.
> > Slightly below what I'd expect your 36 HDDs to be able to write at
> > about 2.5GB/s (36*70MB/s), but not unreasonably so.
> > However your initial network thoughts are massively overspec'ed for
> > this kind of performance.
> >
> 
> I'll add a 3rd 400GB P3700 to the build. Somebody else on IRC suggested
> that I'd need more throughput as well.
> 
At that rate I'd look at your combined PCIe bandwidth and see if there is
enough of it.

> 
> >
> > >
> > > 3 MDS nodes:
> > > -SuperMicro 1028TP-DTR (one node from scale-out chassis)
> > > --2x E5-2630v4
> > > --128GB RAM
> > > --2x 120GB SSD (RAID 1 for OS)
> > Not using CephFS, but if the MDS are like all the other Ceph bits
> > (MONs in particular) they are likely to do happy writes to leveldbs or
> > the likes, do verify that.
> > If that's the case, fast and durable SSDs will be needed.
> >
> >
> Would a 200GB Intel DC S3710 be more what you'd expect? Would you even
> bother to RAID 1 the drive?
> 
That or even a 3610 (at 3 DWPD).
I RAID1 all my SSDs (and Ceph is a RAID of sorts), simply so I don't have
to bother setting things up again, potentially loosing some configs or
keys that aren't backed up anywhere.
But then again, not one Intel SSD has died on me yet and most of them
are years away from doing so, endurance wise.

Christian

> 
> > >
> > > 5 MON nodes:
> > > -SuperMicro 1028TP-DTR (one node from scale-out chassis)
> > > --2x E5-2630v4
> > > --128GB RAM
> > > --2x 120GB SSD (RAID 1 for OS)
> > >
> > Total overkill, are you sure you didn't mix up the CPUs for the OSDs
> > with the ones for the MONs?
> > Also, while dedicated MONs are nice, they really can live rather
> > frugally, except for the lust for fast, durable storage.
> > If I were you, I'd get 2 dedicated MON nodes (with few, fastish cores)
> > and 32-64GB RAM, then put the other 3 on your MDS nodes which seem to
> > have plenty resources to go around.
> > You will want the dedicated MONs to have the lowest IPs in your
> > network, the monitor leader is chosen by that.
> >
> > Christian
> > > We'd use our existing Zabbix deployment for monitoring and ELK for
> > > log aggregation.
> > >
> > > Provisioning would be through puppet-razor (PXE) and puppet.
> > >
> > > Again, thank you for any information you can provide
> > >
> > > --Brady
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com