Re: Have 2 different public networks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thought I'd share details of my setup as I am effectively achieving this (ie making monitors accessible over multiple interfaces) with IP routing as follows:

My Ceph hosts each have a /32 IP address on a loopback interface. And that is the IP address that all their Ceph daemons are bound to. In ceph.conf I do that by setting all the following values to the host's loopback IP: "mon addr" in the [mon.x] sections; "cluster addr" and "public addr" in the [mds.x] sections; and "cluster addr", "public addr", and "osd heartbeat addr" in the [osd.x] sections. Then I use IP routing to ensure that the hosts can all reach each other's loopback IPs, and that clients can reach them too, over the relevant networks. This also allows inter-host traffic to fail over to using alternate paths if the normal path is down for some reason.

To put this in context, my cluster is just a small 3-node cluster, and I have a pair of layer 3 switches, with networking arranged like this:

node1:

has an 8 Gbps point-to-point routed link to node2 (and uses this link to communicate with node2 under normal circumstances)
has an 8 Gbps point-to-point routed link to node3 (and uses this link to communicate with node3 under normal circumstances)
has a 2 Gbps routed link to the primary layer 3 switch (clients which are not other nodes in the cluster use this link to communicate with node1 under normal circumstances, and traffic to nodes 2 and 3 will switch to using this if their respective point-to-point links go down)
has a 1 Gbps routed link to the secondary layer 3 switch (this is not used under normal circumstances - only if the 2 Gbps link goes down)

node2:

has an 8 Gbps point-to-point routed link to node3 (and uses this link to communicate with node3 under normal circumstances)
has an 8 Gbps point-to-point routed link to node1 (and uses this link to communicate with node1 under normal circumstances)
has a 2 Gbps routed link to the primary layer 3 switch (clients which are not other nodes in the cluster use this link to communicate with node2 under normal circumstances, and traffic to nodes 3 and 1 will switch to using this if their respective point-to-point links go down)
has a 1 Gbps routed link to the secondary layer 3 switch (this is not used under normal circumstances - only if the 2 Gbps link goes down)

node3:

has an 8 Gbps point-to-point routed link to node2 (and uses this link to communicate with node1 under normal circumstances)
has an 8 Gbps point-to-point routed link to node3 (and uses this link to communicate with node2 under normal circumstances)
has a 2 Gbps routed link to the primary layer 3 switch (clients which are not other nodes in the cluster use this link to communicate with node3 under normal circumstances, and traffic to nodes 1 and 2 will switch to using this if their respective point-to-point links go down)
has a 1 Gbps routed link to the secondary layer 3 switch (this is not used under normal circumstances - only if the 2 Gbps link goes down)

I have avoided using a proper routing protocol for this, as the failover still works automatically when links go down even with static routes. Although I do also have scripts running on the hosts that detect when a device at the other end of a link is not pingable even though the link is up, and dynamically removes/inserts the routes as necessary in such a situation. But adapting this approach to a larger cluster where point-to-point links between all hosts isn't viable might well warrant use of a routing protocol.

The end result being that I have more control over where the different traffic goes, and it allows me to mess around with the networking without any effect on the cluster.

Alex

On 20/12/2014 5:23 AM, Craig Lewis wrote:


On Fri, Dec 19, 2014 at 6:19 PM, Francois Lafont <flafdivers@xxxxxxx> wrote:

So, indeed, I have to use routing *or* maybe create 2 monitors
by server like this:

[mon.node1-public1]
    host     = ceph-node1
    mon addr = 10.0.1.1

[mon.node1-public2]
    host     = ceph-node1
    mon addr = 10.0.2.1

# etc...

But, in this case, the working directories of mon.node1-public1
and mon.node1-public2 will be in the same disk (I have no
choice). Is it a problem? Are monitors big consumers of I/O disk?


Interesting idea.  While you will have an even number of monitors, you'll still have an odd number of failure domains.  I'm not sure if it'll work though... make sure you test having the leader on both networks.  It might cause problems if the leader is on the 10.0.1.0/24 network?

Monitors can be big consumers of disk IO, if there is a lot of cluster activity.  Monitors records all of the cluster changes in LevelDB, and send copies to all of the daemons.  There have been posts to the ML about people running out of Disk IOps on the monitors, and the problems it causes.  The bigger the cluster, the more IOps.  As long as you monitor and alert on your monitor disk IOps, I don't think it would be a problem.  


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux