Re: Redundant networks in Ceph

Nick Fisk <nick@xxxxxxxxxx> · Sun, 28 Jun 2015 22:45:52 +0100



> -----Original Message-----
> From: Alex Gorbachev [mailto:ag@xxxxxxxxxxxxxxxxxxx]
> Sent: 28 June 2015 18:57
> To: Nick Fisk
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Redundant networks in Ceph
> 
> Hi Nick,
> 
> I know what you mean, no matter how hard you try something unexpected
> always happens. That said I think OSD timeouts should be higher than HSRP
> and spanning tree convergence times, so I think it should survive most
> incidents that I can think of.
> 
> So for high speed networking (40+ GbE) it seems the MLAG/bond solutions
> are the best option due to nonblocking nature and the requirement that
> Ceph networks stay connected - i.e. Layer 2 VLAN for ceph public, another
> layer 2 VLAN for ceph cluster and yet another layer 2 VLAN for iSCSI.  I think
> what we saw was a physically defective port creating hangs in the switch, or
> something similar, so there are some timeouts from time to time.  ARP
> proxies, and routing issues had created similar incidents in the past at other
> sites.
> 
> 
> >
> > The CRUSH map idea sounds interesting.  But there are still concerns, such
> as
> > massive data relocations East-West (between racks in a leaf-spine
> > architecture such as
> > https://community.mellanox.com/docs/DOC-1475 , should there be an
> > outage in the spine.  Plus such issues are enormously hard to troubleshoot.
> 
> You can set the maximum crush grouping that will allow OSD's to be marked
> out. You can use this to stop unwanted data movement from occurring
> during outages.
> 
> Do you have a CRUSH map example by any chance?

You just need to set "mon osd downout subtree limit" in your ceph.conf

> 
> 
> Ah, yeah, been there with LIO and esxi and gave up on it. I found any pause
> longer than around 10 seconds would send both of them into a death spiral. I
> know you currently only see it due to some networking blip, but you will
> most likely also see it when disks fail...etc For me I couldn't have all my
> Datastores going down every time something blipped or got a little slow.
> There are discussions ongoing about it on the Target mailing list and Mike
> Christie from Redhat is looking into the problem, so hopefully it will get
> sorted at some point. For what it's worth, both SCST and TGT seem to be
> immune from this.
> 
> Odd thing is I can fail drives and switches, and connections in a lab under 16
> stream workload from two VMs and never get a timeout like this.  In our POC
> cloud environment though (which does have larger drives and more of them,
> and 9 VM hosts in 2 clusters vs. 2 VM hosts in 1 cluster for lab), we do see
> these "abort-APD-PDL" storms that propagate to hostd hangs and all kind of
> unpleasant consequences.
> 
> I saw many patches slated for kernel 4.2 on the target-devel list and I have
> provided a lot of diagnostic data there, but can see RBD hangs at times in
> osdc like this:
> 
> root@roc-4r-scd214:/sys/kernel/debug/ceph/8d5c925a-f6b9-4064-9ea7-
> f4770eca7247.client1615259# cat osdc
> 22143   osd11   11.e4c3492      rbd_data.319f32ae8944a.000000000000000b        r
> ead
> 23771   osd31   21.dda0f5af     rbd_data.b6ce62ae8944a.000000000000e7a8
> set-alloc-hint,write
> 23782   osd31   11.e505a6ea     rbd_data.3dd222ae8944a.00000000000000c9
>  read
> 26228   osd2    11.ec37db43     rbd_data.319f32ae8944a.0000000000000006
> read
> 26260   osd31   21.dda0f5af     rbd_data.b6ce62ae8944a.000000000000e7a8
> set-alloc-hint,write
> 26338   osd31   11.e505a6ea     rbd_data.3dd222ae8944a.00000000000000c9
>  read
> root@roc-4r-scd214:/sys/kernel/debug/ceph/8d5c925a-f6b9-4064-9ea7-
> f4770eca7247.client1615259#
> 
> So the ongoing discussion is :is this ceph being slow or LIO being not resilient
> enough".  Ref:
> 
> http://www.spinics.net/lists/target-devel/msg09311.html
> 
> http://www.spinics.net/lists/target-devel/msg09682.html
> 
> And especially for discussion about allowing iscsi login during another
> hang: http://www.spinics.net/lists/target-devel/msg09687.html
> and http://www.spinics.net/lists/target-devel/msg09688.html
> 
> >
> > We'll replace the whole network, but I was thinking, having seen such
> issues
> > at a few other sites, if a "B-bus" for networking would be a good design for
> > OSDs.  This approach is commonly used in traditional SANs, where the "A
> > bus" and "B bus" are not connected,so they cannot possibly cross
> > contaminate in any way.
> 
> Probably implementing something like multipathTCP would be the best bet
> to mirror the traditional dual fabric SAN design.
> 
> Assuming http://www.multipath-tcp.org/
> and http://lwn.net/Articles/544399/
> 
> Looks very interesting.
> 
> >
> > >
> > > Nick
> > >
> > >> -----Original Message-----
> > >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> Behalf
> > >> Of Alex Gorbachev
> > >> Sent: 27 June 2015 19:02
> > >> To: ceph-users@xxxxxxxxxxxxxx
> > >> Subject:  Redundant networks in Ceph
> > >>
> > >> The current network design in Ceph
> > >> (http://ceph.com/docs/master/rados/configuration/network-config-
> ref)
> > >> uses nonredundant networks for both cluster and public
> communication.
> > >> Ideally, in a high load environment these will be 10 or 40+ GbE networks.
> > > For
> > >> cost reasons, most such installation will use the same switch
> > >> hardware and separate Ceph traffic using VLANs.
> > >>
> > >> Networking in complex, and situations are possible when switches and
> > >> routers drop traffic.  We ran into one of those at one of our sites -
> > >> connections to hosts stay up (so bonding NICs does not help), yet OSD
> > >> communication gets disrupted, client IO hangs and failures cascade to
> > > client
> > >> applications.
> > >>
> > >> My understanding is that if OSDs cannot connect for some time over
> > >> the cluster network, that IO will hang and time out.  The document
> states
> > "
> > >>
> > >> If you specify more than one IP address and subnet mask for either
> > >> the public or the cluster network, the subnets within the network
> > >> must be capable of routing to each other."
> > >>
> > >> Which in real world means complicated Layer 3 setup for routing and
> > >> is not practical in many configurations.
> > >>
> > >> What if there was an option for "cluster 2" and "public 2" networks,
> > >> to
> > > which
> > >> OSDs and MONs would go either in active/backup or active/active mode
> > >> (cluster 1 and cluster 2 exist separately do not route to each other)?
> > >>
> > >> The difference between this setup and bonding is that here decision
> > >> to
> > > fail
> > >> over and try the other network is at OSD/MON level, and it bring
> > > resilience to
> > >> faults within the switch core, which is really only detectable at
> > > application
> > >> layer.
> > >>
> > >> Am I missing an already existing feature?  Please advise.
> > >>
> > >> Best regards,
> > >> Alex Gorbachev
> > >> Intelligent Systems Services Inc.
> > >> _______________________________________________
> > >> ceph-users mailing list
> > >> ceph-users@xxxxxxxxxxxxxx
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > >
> > >
> 
> 
> 


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com