Re: design guidance

Christian Balzer <chibi@xxxxxxx> · Wed, 7 Jun 2017 18:03:27 +0900

Hello,

On Tue, 6 Jun 2017 20:59:40 -0400 Daniel K wrote:

> Christian,
> 
> Thank you for the tips -- I certainly googled my eyes out for a good while
> before asking -- maybe my google-fu wasn't too good last night.
> 
> > I love using IB, alas with just one port per host you're likely best off
> > ignoring it, unless you have a converged network/switches that can make
> > use of it (or run it in Ethernet mode).  
> 
> I've always heard people speak fondly of IB, but I've honestly never dealt
> with it. I'm mostly a network guy at heart, so I'm perfectly comfortable
> aggregating 10GB/s connections till the cows come home. What are some of
> the virtues of IB, over ethernet? (not ethernet over IB)
>
IB natively is very low latency and until not so long ago was also
significantly cheaper than respective Ethernet offerings.

With IPoIB you loose some of that latency advantage, but it's still quite
good. And the advent of "cheap" whitebox as well as big brand switches this
has been eroding the IB cost advantages, too.

Native IB support for Ceph has been in development for years, so don't
hold your breath there, though. 

> > Bluestore doesn't have journals per se and unless you're going to wait for
> > Luminous I wouldn't recommend using Bluestore in production.
> > Hell, I won't be using it any time soon, but anything pre L sounds
> > like outright channeling Murphy to smite you  
> 
> I do like to play with fire often, but not normally with other people's
> data. I suppose I will stay away from Bluestore for now, unless Luminous is
> released within the next few weeks. I am using it on  Kraken in my small
> test-cluster so far without a visit from Murphy.
> 
If you look at the ML archives, there seem to be plenty of problems
creeping up, some of it more serious than others. 
And expect another slew when it goes mainstream and/or becomes the default.

> > That said, what SSD is it?
> > Bluestore WAL needs are rather small.
> > OTOH, a single SSD isn't something I'd recommend either, SPOF and all.  
> 
> > I'm guessing you have no budget to improve on that gift horse?  
> 
> It's a Micron 1100 256Gb, rated for 120TBW, which works out to about
> 100GB/day for 3 years, so not even .5DWPD. I doubt it has the endurance to
> journal 36 1TB drives.
> 
Yeah, no go with that.

> I do have some room in the budget, and NVMe journals have been on the back
> of my mind. These servers have 6 PCIe x8 slots in them, so tons of room.
> But then I'm going to get asked about a cache tier, which everyone seems to
> think is the holy grail (and probably would be, if they could 'just work')
> 
> But from what I read, they're an utter nightmare to manage, particularly
> without a well defined workload, and often would hurt more than they help.
> 
I'm very happy with them, but I do have a perfect match in terms of
workload, use case and experience. ^o^

But in any shape or form those servers are underpowered CPU wise and
putting more things into them won't improve things.
Building a cache-tier on dedicated nodes (that's what I do) is another
story.

To correlate CPU usage/needs, I'm literally in the middle of setting up a
new cluster (small, as usual).

The 3 HDD storage nodes have 1 E5-1650 v3 @ 3.50GHz CPU (6 core/SMT = 12
linux cores), 64GB RAM, IB, 12 3TB SAS HDDs and 2 400GB DC S3710 SSDs for
OS and journals.

The 3 cache-tier nodes have 2 E5-2623 v3 @ 3.00GHz CPUs (4 cores/SMT), 64GB
RAM, IB and 5 800GB DC S3610 SSDs.

If you run this fio against a kernel mounted RBD image (same diff from a 
userspace Ceph VM):
"fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=randwrite --name=fiojob --blocksize=4k --iodepth=32"

one winds up with 50% CPU (that's half a pseudo-core) usage per OSD
process on the HDD storage, because the HDDs are at 100% util and the
bottleneck. The journal SSDs are bored around 20%. 
Still, half of these 3.5GHz (yes it ramps up to full speed) cores gone,
now relate that to your triple OSDs and slower CPUs...

On the the cache-tier pool that fio results in about 50% SSD utilization
but with each OSD process now consuming about 210%, a full real core.
So this one is neither CPU nor storage limited, the latency/RTT is the
limiting factor here.

Fun fact, for large sequential writes the cache-tier is actually slightly
slower, due to the co-location of the journals on the OSD SSDs. 
For small, random IOPS that of course is not true. ^.^

> I haven't spent a ton of time with the network gear that was dumped on me,
> but the switches I have now are a Nexus 7000, x4 Force10 S4810 (so I do
> have some stackable 10Gb that I can MC-LAG), x2 Mellanox IS5023 (18 port IB
> switch), what appears to be a giant IB switch (Qlogic 12800-120) and
> another apparently big boy (Qlogic 12800-180). I'm going to pick them up
> from the warehouse tomorrow.
> 
> If I stay away from IB completely, may just use the IB card as a 4x10GB +
> the 2x 10GB on board like I had originally mentioned. But if that IB gear
> is good, I'd hate to see it go to waste. Might be worth getting a second IB
> card for each server.
> 
If you're happy with getting old ones, I hear they can be found quite
cheap.

Christian
> 
> 
> Again, thanks a million for the advice. I'd rather learn this the easy way
> than to have to rebuild this 6 times over the next 6 months.
> 
> 
> 
> 
> 
> 
> On Tue, Jun 6, 2017 at 2:05 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> >
> > Hello,
> >
> > lots of similar questions in the past, google is your friend.
> >
> > On Mon, 5 Jun 2017 23:59:07 -0400 Daniel K wrote:
> >  
> > > I've built 'my-first-ceph-cluster' with two of the 4-node, 12 drive
> > > Supermicro servers and dual 10Gb interfaces(one cluster, one public)
> > >
> > > I now have 9x 36-drive supermicro StorageServers made available to me,  
> > each  
> > > with dual 10GB and a single Mellanox IB/40G nic. No 1G interfaces except
> > > IPMI. 2x 6-core 6-thread 1.7ghz xeon processors (12 cores total) for 36
> > > drives. Currently 32GB of ram. 36x 1TB 7.2k drives.
> > >  
> > I love using IB, alas with just one port per host you're likely best off
> > ignoring it, unless you have a converged network/switches that can make
> > use of it (or run it in Ethernet mode).
> >  
> > > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5 and
> > > 6.0 hosts(migrating from a VMWare environment), later to transition to
> > > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio and  
> > saw  
> > > much worse performance with the first cluster, so it seems this may be  
> > the  
> > > better way, but I'm open to other suggestions.
> > >  
> > I've never seen any ultimate solution to providing HA iSCSI on top of
> > Ceph, though other people here have made significant efforts.
> >  
> > > Considerations:
> > > Best practice documents indicate .5 cpu per OSD, but I have 36 drives and
> > > 12 CPUs. Would it be better to create 18x 2-drive raid0 on the hardware
> > > raid card to present a fewer number of larger devices to ceph? Or run
> > > multiple drives per OSD?
> > >  
> > You're definitely underpowered in the CPU department and I personally
> > would make RAID1 or 10s for never having to re-balance an OSD.
> > But if space is an issue, RAID0s would do.
> > OTOH, w/o any SSDs in the game your HDD only cluster is going to be less
> > CPU hungry than others.
> >  
> > > There is a single 256gb SSD which i feel would be a bottleneck if I used  
> > it  
> > > as a journal for all 36 drives, so I believe bluestore with a journal on
> > > each drive would be the best option.
> > >  
> > Bluestore doesn't have journals per se and unless you're going to wait for
> > Luminous I wouldn't recommend using Bluestore in production.
> > Hell, I won't be using it any time soon, but anything pre L sounds
> > like outright channeling Murphy to smite you.
> >
> > That said, what SSD is it?
> > Bluestore WAL needs are rather small.
> > OTOH, a single SSD isn't something I'd recommend either, SPOF and all.
> >
> > I'm guessing you have no budget to improve on that gift horse?
> >  
> > > Is 1.7Ghz too slow for what I'm doing?
> > >  
> > If you're going to have a lot of small I/Os it probably will be.
> >  
> > > I like the idea of keeping the public and cluster networks separate.  
> >
> > I don't, at least not on a physical level when you pay for this by loosing
> > redundancy.
> > Do you have 2 switches, are they MC-LAG capable (aka stackable)?
> >  
> > >Any
> > > suggestions on which interfaces to use for what? I could theoretically  
> > push  
> > > 36Gb/s, figuring 125MB/s for each drive, but in reality will I ever see
> > > that?  
> > Not by a long shot, even with Bluestore.
> > With the WAL and other bits on SSD and very kind write patterns, maybe
> > 100MB/s per drive, but IIRC there were issues with current Bluestore and
> > performance as well.
> >  
> > >Perhaps bond the two 10GB and use them as the public, and the 40gb as
> > > the cluster network? Or split the 40gb in to 4x10gb and use 3x10GB bonded
> > > for each?
> > >  
> > If you can actually split it up, see above, mc-LAG.
> > That will give you 60Gb/s, half that if a switch fails and if it makes you
> > fell better, do the cluster and public with VLANs.
> >
> > But that will cost you in not so cheap switch ports, of course.
> >
> > Christian  
> > > If there is a more appropriate venue for my request, please point me in
> > > that direction.
> > >
> > > Thanks,
> > > Dan  
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Rakuten Communications
> >  

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com