Re: Redundant networks in Ceph

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Sun, 28 Jun 2015 13:57:03 -0400

Hi Nick,
I know what you mean, no matter how hard you try something unexpected always happens. That said I think OSD timeouts should be higher than HSRP and spanning tree convergence times, so I think it should survive most incidents that I can think of.

So for high speed networking (40+ GbE) it seems the MLAG/bond solutions are the best option due to nonblocking nature and the requirement that Ceph networks stay connected - i.e. Layer 2 VLAN for ceph public, another layer 2 VLAN for ceph cluster and yet another layer 2 VLAN for iSCSI.  I think what we saw was a physically defective port creating hangs in the switch, or something similar, so there are some timeouts from time to time.  ARP proxies, and routing issues had created similar incidents in the past at other sites.

>

> The CRUSH map idea sounds interesting.  But there are still concerns, such as

> massive data relocations East-West (between racks in a leaf-spine

> architecture such as

> https://community.mellanox.com/docs/DOC-1475 , should there be an

> outage in the spine.  Plus such issues are enormously hard to troubleshoot.

You can set the maximum crush grouping that will allow OSD's to be marked out. You can use this to stop unwanted data movement from occurring during outages.

Do you have a CRUSH map example by any chance?

Ah, yeah, been there with LIO and esxi and gave up on it. I found any pause longer than around 10 seconds would send both of them into a death spiral. I know you currently only see it due to some networking blip, but you will most likely also see it when disks fail...etc For me I couldn't have all my Datastores going down every time something blipped or got a little slow. There are discussions ongoing about it on the Target mailing list and Mike Christie from Redhat is looking into the problem, so hopefully it will get sorted at some point. For what it's worth, both SCST and TGT seem to be immune from this.

Odd thing is I can fail drives and switches, and connections in a lab under 16 stream workload from two VMs and never get a timeout like this.  In our POC cloud environment though (which does have larger drives and more of them, and 9 VM hosts in 2 clusters vs. 2 VM hosts in 1 cluster for lab), we do see these "abort-APD-PDL" storms that propagate to hostd hangs and all kind of unpleasant consequences.  

I saw many patches slated for kernel 4.2 on the target-devel list and I have provided a lot of diagnostic data there, but can see RBD hangs at times in osdc like this:

root@roc-4r-scd214:/sys/kernel/debug/ceph/8d5c925a-f6b9-4064-9ea7-f4770eca7247.client1615259# cat osdc
22143   osd11   11.e4c3492      rbd_data.319f32ae8944a.000000000000000b        read
23771   osd31   21.dda0f5af     rbd_data.b6ce62ae8944a.000000000000e7a8        set-alloc-hint,write
23782   osd31   11.e505a6ea     rbd_data.3dd222ae8944a.00000000000000c9        read
26228   osd2    11.ec37db43     rbd_data.319f32ae8944a.0000000000000006        read
26260   osd31   21.dda0f5af     rbd_data.b6ce62ae8944a.000000000000e7a8        set-alloc-hint,write
26338   osd31   11.e505a6ea     rbd_data.3dd222ae8944a.00000000000000c9        read
root@roc-4r-scd214:/sys/kernel/debug/ceph/8d5c925a-f6b9-4064-9ea7-f4770eca7247.client1615259#

So the ongoing discussion is :is this ceph being slow or LIO being not resilient enough".  Ref:

http://www.spinics.net/lists/target-devel/msg09311.html

http://www.spinics.net/lists/target-devel/msg09682.html

And especially for discussion about allowing iscsi login during another hang: http://www.spinics.net/lists/target-devel/msg09687.html and http://www.spinics.net/lists/target-devel/msg09688.html

>

> We'll replace the whole network, but I was thinking, having seen such issues

> at a few other sites, if a "B-bus" for networking would be a good design for

> OSDs.  This approach is commonly used in traditional SANs, where the "A

> bus" and "B bus" are not connected,so they cannot possibly cross

> contaminate in any way.

Probably implementing something like multipathTCP would be the best bet to mirror the traditional dual fabric SAN design.

Assuming http://www.multipath-tcp.org/ and http://lwn.net/Articles/544399/

Looks very interesting.  

>

> >

> > Nick

> >

> >> -----Original Message-----

> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf

> >> Of Alex Gorbachev

> >> Sent: 27 June 2015 19:02

> >> To: ceph-users@xxxxxxxxxxxxxx

> >> Subject:  Redundant networks in Ceph

> >>

> >> The current network design in Ceph

> >> (http://ceph.com/docs/master/rados/configuration/network-config-ref)

> >> uses nonredundant networks for both cluster and public communication.

> >> Ideally, in a high load environment these will be 10 or 40+ GbE networks.

> > For

> >> cost reasons, most such installation will use the same switch

> >> hardware and separate Ceph traffic using VLANs.

> >>

> >> Networking in complex, and situations are possible when switches and

> >> routers drop traffic.  We ran into one of those at one of our sites -

> >> connections to hosts stay up (so bonding NICs does not help), yet OSD

> >> communication gets disrupted, client IO hangs and failures cascade to

> > client

> >> applications.

> >>

> >> My understanding is that if OSDs cannot connect for some time over

> >> the cluster network, that IO will hang and time out.  The document states

> "

> >>

> >> If you specify more than one IP address and subnet mask for either

> >> the public or the cluster network, the subnets within the network

> >> must be capable of routing to each other."

> >>

> >> Which in real world means complicated Layer 3 setup for routing and

> >> is not practical in many configurations.

> >>

> >> What if there was an option for "cluster 2" and "public 2" networks,

> >> to

> > which

> >> OSDs and MONs would go either in active/backup or active/active mode

> >> (cluster 1 and cluster 2 exist separately do not route to each other)?

> >>

> >> The difference between this setup and bonding is that here decision

> >> to

> > fail

> >> over and try the other network is at OSD/MON level, and it bring

> > resilience to

> >> faults within the switch core, which is really only detectable at

> > application

> >> layer.

> >>

> >> Am I missing an already existing feature?  Please advise.

> >>

> >> Best regards,

> >> Alex Gorbachev

> >> Intelligent Systems Services Inc.

> >> _______________________________________________

> >> ceph-users mailing list

> >> ceph-users@xxxxxxxxxxxxxx

> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >

> >

> >

> >

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com