Re: Network redundancy pro and cons, best practice, suggestions?

Christian Balzer <chibi@xxxxxxx> · Mon, 13 Apr 2015 19:54:49 +0900

Hello,

On Mon, 13 Apr 2015 11:03:24 +0200 Götz Reinicke - IT Koordinator wrote:

> Dear ceph users,
> 
> we are planing a ceph storage cluster from scratch. Might be up to 1 PB
> within the next 3 years, multiple buildings, new network infrastructure
> for the cluster etc.
> 
> I had some excellent trainings on ceph, so the essential fundamentals
> are familiar to me, and I know our goals/dreams can be reached. :)
> 
> There is just "one tiny piece" in the design I'm currently unsure
> about :)
> 
> Ceph follows some sort of keep it small and simple, e.g. dont use raid
> controllers, use more boxes and disks, fast network etc.
> 
While small and plenty is definitely true, some people actually use RAID
for OSDs (like RAID1) to avoid ever having to deal with a failed OSD and
getting a 4x replication in the end. 
Your needs and budget may of course differ.

> So from our current design we plan 40Gb Storage and Client LAN.
> 
> Would you suggest to connect the OSD nodes redundant to both networks?
> That would end up with 4 * 40Gb ports in each box, two Switches to
> connect to.
> 
If you can afford it, fabric switches are quite nice, as they allow for
LACP over 2 switches, so if everything is working you get twice the speed,
if not still full redundancy. The Brocade VDX stuff comes to mind.

However if you're not tied into an Ethernet network, you might do better
and cheaper with an Infiniband network on the storage side of things.
This will become even more attractive as RDMA support improves with Ceph.

Separating public (client) and private (storage, OSD interconnect)
networks with Ceph makes only sense if your storage node can actually
utilize all that bandwidth.

So at your storage node density of 12 HDDs (16 HDD chassis are not space
efficient), 40GbE is overkill with a single link/network, insanely so with
2 networks.

> I'd think of OSD nodes with 12 - 16 * 4TB SATA disks for "high" io
> pools. (+ currently SSD for journal, but may be until we start, levelDB,
> rocksDB are ready ... ?)
> 
> Later some less io bound pools for data archiving/backup. (bigger and
> more Disks per node)
> 
> We would also do some Cache tiering for some pools.
> 
> From HP, Intel, Supermicron etc reference documentations, they use
> usually non-redundant network connection. (single 10Gb)
> 
> I know: redundancy keeps some headaches small, but also adds some more
> complexity and increases the budget. (add network adapters, other
> server, more switches, etc)
> 
Complexity not so much, cost yes.

> So what would you suggest, what are your experiences?
> 
It all depends on how small (large really) you can start.

I have only small clusters with few nodes, so for me redundancy is a big
deal.
Thus those cluster use Infiniband, 2 switches and dual-port HCAs on the
nodes in an active-standby mode.

If you however can start with something like 10 racks (ToR switches),
loosing one switch would mean a loss of 10% of your cluster, which is
something it should be able to cope with.
Especially if you configured Ceph to _not_ start re-balancing data
automatically if a rack goes down (so that you have a chance to put a
replacement switch in place, which you of course kept handy on-site for
such a case). ^.-

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com