Christian,
> use of it (or run it in Ethernet mode).
> Hell, I won't be using it any time soon, but anything pre L sounds
> like outright channeling Murphy to smite you
Thank you for the tips -- I certainly googled my eyes out for a good while before asking -- maybe my google-fu wasn't too good last night.
> I love using IB, alas with just one port per host you're likely best off
> ignoring it, unless you have a converged network/switches that can make> use of it (or run it in Ethernet mode).
I've always heard people speak fondly of IB, but I've honestly never dealt with it. I'm mostly a network guy at heart, so I'm perfectly comfortable aggregating 10GB/s connections till the cows come home. What are some of the virtues of IB, over ethernet? (not ethernet over IB)
> Bluestore doesn't have journals per se and unless you're going to wait for
> Luminous I wouldn't recommend using Bluestore in production.> Hell, I won't be using it any time soon, but anything pre L sounds
> like outright channeling Murphy to smite you
I do like to play with fire often, but not normally with other people's data. I suppose I will stay away from Bluestore for now, unless Luminous is released within the next few weeks. I am using it on Kraken in my small test-cluster so far without a visit from Murphy.
> That said, what SSD is it?
> Bluestore WAL needs are rather small.
> OTOH, a single SSD isn't something I'd recommend either, SPOF and all.
> I'm guessing you have no budget to improve on that gift horse?
> OTOH, a single SSD isn't something I'd recommend either, SPOF and all.
> I'm guessing you have no budget to improve on that gift horse?
It's a Micron 1100 256Gb, rated for 120TBW, which works out to about 100GB/day for 3 years, so not even .5DWPD. I doubt it has the endurance to journal 36 1TB drives.
I do have some room in the budget, and NVMe journals have been on the back of my mind. These servers have 6 PCIe x8 slots in them, so tons of room. But then I'm going to get asked about a cache tier, which everyone seems to think is the holy grail (and probably would be, if they could 'just work')
But from what I read, they're an utter nightmare to manage, particularly without a well defined workload, and often would hurt more than they help.
I haven't spent a ton of time with the network gear that was dumped on me, but the switches I have now are a Nexus 7000, x4 Force10 S4810 (so I do have some stackable 10Gb that I can MC-LAG), x2 Mellanox IS5023 (18 port IB switch), what appears to be a giant IB switch (Qlogic 12800-120) and another apparently big boy (Qlogic 12800-180). I'm going to pick them up from the warehouse tomorrow.
If I stay away from IB completely, may just use the IB card as a 4x10GB + the 2x 10GB on board like I had originally mentioned. But if that IB gear is good, I'd hate to see it go to waste. Might be worth getting a second IB card for each server.
Again, thanks a million for the advice. I'd rather learn this the easy way than to have to rebuild this 6 times over the next 6 months.
On Tue, Jun 6, 2017 at 2:05 AM, Christian Balzer <chibi@xxxxxxx> wrote:
Hello,
lots of similar questions in the past, google is your friend.
On Mon, 5 Jun 2017 23:59:07 -0400 Daniel K wrote:
> I've built 'my-first-ceph-cluster' with two of the 4-node, 12 drive
> Supermicro servers and dual 10Gb interfaces(one cluster, one public)
>
> I now have 9x 36-drive supermicro StorageServers made available to me, each
> with dual 10GB and a single Mellanox IB/40G nic. No 1G interfaces except
> IPMI. 2x 6-core 6-thread 1.7ghz xeon processors (12 cores total) for 36
> drives. Currently 32GB of ram. 36x 1TB 7.2k drives.
>
I love using IB, alas with just one port per host you're likely best off
ignoring it, unless you have a converged network/switches that can make
use of it (or run it in Ethernet mode).
> Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5 and
> 6.0 hosts(migrating from a VMWare environment), later to transition to
> qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio and saw
> much worse performance with the first cluster, so it seems this may be the
> better way, but I'm open to other suggestions.
>
I've never seen any ultimate solution to providing HA iSCSI on top of
Ceph, though other people here have made significant efforts.
> Considerations:
> Best practice documents indicate .5 cpu per OSD, but I have 36 drives and
> 12 CPUs. Would it be better to create 18x 2-drive raid0 on the hardware
> raid card to present a fewer number of larger devices to ceph? Or run
> multiple drives per OSD?
>
You're definitely underpowered in the CPU department and I personally
would make RAID1 or 10s for never having to re-balance an OSD.
But if space is an issue, RAID0s would do.
OTOH, w/o any SSDs in the game your HDD only cluster is going to be less
CPU hungry than others.
> There is a single 256gb SSD which i feel would be a bottleneck if I used it
> as a journal for all 36 drives, so I believe bluestore with a journal on
> each drive would be the best option.
>
Bluestore doesn't have journals per se and unless you're going to wait for
Luminous I wouldn't recommend using Bluestore in production.
Hell, I won't be using it any time soon, but anything pre L sounds
like outright channeling Murphy to smite you.
That said, what SSD is it?
Bluestore WAL needs are rather small.
OTOH, a single SSD isn't something I'd recommend either, SPOF and all.
I'm guessing you have no budget to improve on that gift horse?
> Is 1.7Ghz too slow for what I'm doing?
>
If you're going to have a lot of small I/Os it probably will be.
> I like the idea of keeping the public and cluster networks separate.
I don't, at least not on a physical level when you pay for this by loosing
redundancy.
Do you have 2 switches, are they MC-LAG capable (aka stackable)?
>Any
> suggestions on which interfaces to use for what? I could theoretically push
> 36Gb/s, figuring 125MB/s for each drive, but in reality will I ever see
> that?
Not by a long shot, even with Bluestore.
With the WAL and other bits on SSD and very kind write patterns, maybe
100MB/s per drive, but IIRC there were issues with current Bluestore and
performance as well.
>Perhaps bond the two 10GB and use them as the public, and the 40gb as
> the cluster network? Or split the 40gb in to 4x10gb and use 3x10GB bonded
> for each?
>
If you can actually split it up, see above, mc-LAG.
That will give you 60Gb/s, half that if a switch fails and if it makes you
fell better, do the cluster and public with VLANs.
But that will cost you in not so cheap switch ports, of course.
Christian
--> If there is a more appropriate venue for my request, please point me in
> that direction.
>
> Thanks,
> Dan
Christian Balzer Network/Systems Engineer
chibi@xxxxxxx Rakuten Communications
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com