Re: design guidance

Daniel K <sathackr@xxxxxxxxx> · Tue, 6 Jun 2017 20:59:40 -0400

Christian,
Thank you for the tips -- I certainly googled my eyes out for a good while before asking -- maybe my google-fu wasn't too good last night.

> I love using IB, alas with just one port per host you're likely best off
> ignoring it, unless you have a converged network/switches that can make
> use of it (or run it in Ethernet mode).
I've always heard people speak fondly of IB, but I've honestly never dealt with it. I'm mostly a network guy at heart, so I'm perfectly comfortable aggregating 10GB/s connections till the cows come home. What are some of the virtues of IB, over ethernet? (not ethernet over IB)

> Bluestore doesn't have journals per se and unless you're going to wait for
> Luminous I wouldn't recommend using Bluestore in production.
> Hell, I won't be using it any time soon, but anything pre L sounds
> like outright channeling Murphy to smite you
I do like to play with fire often, but not normally with other people's data. I suppose I will stay away from Bluestore for now, unless Luminous is released within the next few weeks. I am using it on  Kraken in my small test-cluster so far without a visit from Murphy.

> That said, what SSD is it?
> Bluestore WAL needs are rather small.
> OTOH, a single SSD isn't something I'd recommend either, SPOF and all.

> I'm guessing you have no budget to improve on that gift horse?

It's a Micron 1100 256Gb, rated for 120TBW, which works out to about 100GB/day for 3 years, so not even .5DWPD. I doubt it has the endurance to journal 36 1TB drives. 

I do have some room in the budget, and NVMe journals have been on the back of my mind. These servers have 6 PCIe x8 slots in them, so tons of room. But then I'm going to get asked about a cache tier, which everyone seems to think is the holy grail (and probably would be, if they could 'just work')

But from what I read, they're an utter nightmare to manage, particularly without a well defined workload, and often would hurt more than they help.

I haven't spent a ton of time with the network gear that was dumped on me, but the switches I have now are a Nexus 7000, x4 Force10 S4810 (so I do have some stackable 10Gb that I can MC-LAG), x2 Mellanox IS5023 (18 port IB switch), what appears to be a giant IB switch (Qlogic 12800-120) and another apparently big boy (Qlogic 12800-180). I'm going to pick them up from the warehouse tomorrow.

If I stay away from IB completely, may just use the IB card as a 4x10GB + the 2x 10GB on board like I had originally mentioned. But if that IB gear is good, I'd hate to see it go to waste. Might be worth getting a second IB card for each server.

Again, thanks a million for the advice. I'd rather learn this the easy way than to have to rebuild this 6 times over the next 6 months.

On Tue, Jun 6, 2017 at 2:05 AM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

lots of similar questions in the past, google is your friend.

On Mon, 5 Jun 2017 23:59:07 -0400 Daniel K wrote:

> I've built 'my-first-ceph-cluster' with two of the 4-node, 12 drive

> Supermicro servers and dual 10Gb interfaces(one cluster, one public)

>

> I now have 9x 36-drive supermicro StorageServers made available to me, each

> with dual 10GB and a single Mellanox IB/40G nic. No 1G interfaces except

> IPMI. 2x 6-core 6-thread 1.7ghz xeon processors (12 cores total) for 36

> drives. Currently 32GB of ram. 36x 1TB 7.2k drives.

>

I love using IB, alas with just one port per host you're likely best off

ignoring it, unless you have a converged network/switches that can make

use of it (or run it in Ethernet mode).

> Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5 and

> 6.0 hosts(migrating from a VMWare environment), later to transition to

> qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio and saw

> much worse performance with the first cluster, so it seems this may be the

> better way, but I'm open to other suggestions.

>

I've never seen any ultimate solution to providing HA iSCSI on top of

Ceph, though other people here have made significant efforts.

> Considerations:

> Best practice documents indicate .5 cpu per OSD, but I have 36 drives and

> 12 CPUs. Would it be better to create 18x 2-drive raid0 on the hardware

> raid card to present a fewer number of larger devices to ceph? Or run

> multiple drives per OSD?

>

You're definitely underpowered in the CPU department and I personally

would make RAID1 or 10s for never having to re-balance an OSD.

But if space is an issue, RAID0s would do.

OTOH, w/o any SSDs in the game your HDD only cluster is going to be less

CPU hungry than others.

> There is a single 256gb SSD which i feel would be a bottleneck if I used it

> as a journal for all 36 drives, so I believe bluestore with a journal on

> each drive would be the best option.

>

Bluestore doesn't have journals per se and unless you're going to wait for

Luminous I wouldn't recommend using Bluestore in production.

Hell, I won't be using it any time soon, but anything pre L sounds

like outright channeling Murphy to smite you.

That said, what SSD is it?

Bluestore WAL needs are rather small.

OTOH, a single SSD isn't something I'd recommend either, SPOF and all.

I'm guessing you have no budget to improve on that gift horse?

> Is 1.7Ghz too slow for what I'm doing?

>

If you're going to have a lot of small I/Os it probably will be.

> I like the idea of keeping the public and cluster networks separate.

I don't, at least not on a physical level when you pay for this by loosing

redundancy.

Do you have 2 switches, are they MC-LAG capable (aka stackable)?

>Any

> suggestions on which interfaces to use for what? I could theoretically push

> 36Gb/s, figuring 125MB/s for each drive, but in reality will I ever see

> that?

Not by a long shot, even with Bluestore.

With the WAL and other bits on SSD and very kind write patterns, maybe

100MB/s per drive, but IIRC there were issues with current Bluestore and

performance as well.

>Perhaps bond the two 10GB and use them as the public, and the 40gb as

> the cluster network? Or split the 40gb in to 4x10gb and use 3x10GB bonded

> for each?

>

If you can actually split it up, see above, mc-LAG.

That will give you 60Gb/s, half that if a switch fails and if it makes you

fell better, do the cluster and public with VLANs.

But that will cost you in not so cheap switch ports, of course.

Christian

> If there is a more appropriate venue for my request, please point me in

> that direction.

>

> Thanks,

> Dan

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Rakuten Communications

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com