Re: design guidance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Christian,

Thank you for the tips -- I certainly googled my eyes out for a good while before asking -- maybe my google-fu wasn't too good last night.

I love using IB, alas with just one port per host you're likely best off
> ignoring it, unless you have a converged network/switches that can make
> use of it (or run it in Ethernet mode).

I've always heard people speak fondly of IB, but I've honestly never dealt with it. I'm mostly a network guy at heart, so I'm perfectly comfortable aggregating 10GB/s connections till the cows come home. What are some of the virtues of IB, over ethernet? (not ethernet over IB)

Bluestore doesn't have journals per se and unless you're going to wait for
> Luminous I wouldn't recommend using Bluestore in production.
> Hell, I won't be using it any time soon, but anything pre L sounds
> like outright channeling Murphy to smite you

I do like to play with fire often, but not normally with other people's data. I suppose I will stay away from Bluestore for now, unless Luminous is released within the next few weeks. I am using it on  Kraken in my small test-cluster so far without a visit from Murphy.

That said, what SSD is it?
> Bluestore WAL needs are rather small.
> OTOH, a single SSD isn't something I'd recommend either, SPOF and all.

> I'm guessing you have no budget to improve on that gift horse?

It's a Micron 1100 256Gb, rated for 120TBW, which works out to about 100GB/day for 3 years, so not even .5DWPD. I doubt it has the endurance to journal 36 1TB drives. 

I do have some room in the budget, and NVMe journals have been on the back of my mind. These servers have 6 PCIe x8 slots in them, so tons of room. But then I'm going to get asked about a cache tier, which everyone seems to think is the holy grail (and probably would be, if they could 'just work')

But from what I read, they're an utter nightmare to manage, particularly without a well defined workload, and often would hurt more than they help.

I haven't spent a ton of time with the network gear that was dumped on me, but the switches I have now are a Nexus 7000, x4 Force10 S4810 (so I do have some stackable 10Gb that I can MC-LAG), x2 Mellanox IS5023 (18 port IB switch), what appears to be a giant IB switch (Qlogic 12800-120) and another apparently big boy (Qlogic 12800-180). I'm going to pick them up from the warehouse tomorrow.

If I stay away from IB completely, may just use the IB card as a 4x10GB + the 2x 10GB on board like I had originally mentioned. But if that IB gear is good, I'd hate to see it go to waste. Might be worth getting a second IB card for each server.



Again, thanks a million for the advice. I'd rather learn this the easy way than to have to rebuild this 6 times over the next 6 months.






On Tue, Jun 6, 2017 at 2:05 AM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

lots of similar questions in the past, google is your friend.

On Mon, 5 Jun 2017 23:59:07 -0400 Daniel K wrote:

> I've built 'my-first-ceph-cluster' with two of the 4-node, 12 drive
> Supermicro servers and dual 10Gb interfaces(one cluster, one public)
>
> I now have 9x 36-drive supermicro StorageServers made available to me, each
> with dual 10GB and a single Mellanox IB/40G nic. No 1G interfaces except
> IPMI. 2x 6-core 6-thread 1.7ghz xeon processors (12 cores total) for 36
> drives. Currently 32GB of ram. 36x 1TB 7.2k drives.
>
I love using IB, alas with just one port per host you're likely best off
ignoring it, unless you have a converged network/switches that can make
use of it (or run it in Ethernet mode).

> Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5 and
> 6.0 hosts(migrating from a VMWare environment), later to transition to
> qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio and saw
> much worse performance with the first cluster, so it seems this may be the
> better way, but I'm open to other suggestions.
>
I've never seen any ultimate solution to providing HA iSCSI on top of
Ceph, though other people here have made significant efforts.

> Considerations:
> Best practice documents indicate .5 cpu per OSD, but I have 36 drives and
> 12 CPUs. Would it be better to create 18x 2-drive raid0 on the hardware
> raid card to present a fewer number of larger devices to ceph? Or run
> multiple drives per OSD?
>
You're definitely underpowered in the CPU department and I personally
would make RAID1 or 10s for never having to re-balance an OSD.
But if space is an issue, RAID0s would do.
OTOH, w/o any SSDs in the game your HDD only cluster is going to be less
CPU hungry than others.

> There is a single 256gb SSD which i feel would be a bottleneck if I used it
> as a journal for all 36 drives, so I believe bluestore with a journal on
> each drive would be the best option.
>
Bluestore doesn't have journals per se and unless you're going to wait for
Luminous I wouldn't recommend using Bluestore in production.
Hell, I won't be using it any time soon, but anything pre L sounds
like outright channeling Murphy to smite you.

That said, what SSD is it?
Bluestore WAL needs are rather small.
OTOH, a single SSD isn't something I'd recommend either, SPOF and all.

I'm guessing you have no budget to improve on that gift horse?

> Is 1.7Ghz too slow for what I'm doing?
>
If you're going to have a lot of small I/Os it probably will be.

> I like the idea of keeping the public and cluster networks separate.

I don't, at least not on a physical level when you pay for this by loosing
redundancy.
Do you have 2 switches, are they MC-LAG capable (aka stackable)?

>Any
> suggestions on which interfaces to use for what? I could theoretically push
> 36Gb/s, figuring 125MB/s for each drive, but in reality will I ever see
> that?
Not by a long shot, even with Bluestore.
With the WAL and other bits on SSD and very kind write patterns, maybe
100MB/s per drive, but IIRC there were issues with current Bluestore and
performance as well.

>Perhaps bond the two 10GB and use them as the public, and the 40gb as
> the cluster network? Or split the 40gb in to 4x10gb and use 3x10GB bonded
> for each?
>
If you can actually split it up, see above, mc-LAG.
That will give you 60Gb/s, half that if a switch fails and if it makes you
fell better, do the cluster and public with VLANs.

But that will cost you in not so cheap switch ports, of course.

Christian
> If there is a more appropriate venue for my request, please point me in
> that direction.
>
> Thanks,
> Dan


--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx           Rakuten Communications

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux