> -----Original Message----- > From: Alex Gorbachev [mailto:ag@xxxxxxxxxxxxxxxxxxx] > Sent: 28 June 2015 18:57 > To: Nick Fisk > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: Re: Redundant networks in Ceph > > Hi Nick, > > I know what you mean, no matter how hard you try something unexpected > always happens. That said I think OSD timeouts should be higher than HSRP > and spanning tree convergence times, so I think it should survive most > incidents that I can think of. > > So for high speed networking (40+ GbE) it seems the MLAG/bond solutions > are the best option due to nonblocking nature and the requirement that > Ceph networks stay connected - i.e. Layer 2 VLAN for ceph public, another > layer 2 VLAN for ceph cluster and yet another layer 2 VLAN for iSCSI. I think > what we saw was a physically defective port creating hangs in the switch, or > something similar, so there are some timeouts from time to time. ARP > proxies, and routing issues had created similar incidents in the past at other > sites. > > > > > > The CRUSH map idea sounds interesting. But there are still concerns, such > as > > massive data relocations East-West (between racks in a leaf-spine > > architecture such as > > https://community.mellanox.com/docs/DOC-1475 , should there be an > > outage in the spine. Plus such issues are enormously hard to troubleshoot. > > You can set the maximum crush grouping that will allow OSD's to be marked > out. You can use this to stop unwanted data movement from occurring > during outages. > > Do you have a CRUSH map example by any chance? You just need to set "mon osd downout subtree limit" in your ceph.conf > > > Ah, yeah, been there with LIO and esxi and gave up on it. I found any pause > longer than around 10 seconds would send both of them into a death spiral. I > know you currently only see it due to some networking blip, but you will > most likely also see it when disks fail...etc For me I couldn't have all my > Datastores going down every time something blipped or got a little slow. > There are discussions ongoing about it on the Target mailing list and Mike > Christie from Redhat is looking into the problem, so hopefully it will get > sorted at some point. For what it's worth, both SCST and TGT seem to be > immune from this. > > Odd thing is I can fail drives and switches, and connections in a lab under 16 > stream workload from two VMs and never get a timeout like this. In our POC > cloud environment though (which does have larger drives and more of them, > and 9 VM hosts in 2 clusters vs. 2 VM hosts in 1 cluster for lab), we do see > these "abort-APD-PDL" storms that propagate to hostd hangs and all kind of > unpleasant consequences. > > I saw many patches slated for kernel 4.2 on the target-devel list and I have > provided a lot of diagnostic data there, but can see RBD hangs at times in > osdc like this: > > root@roc-4r-scd214:/sys/kernel/debug/ceph/8d5c925a-f6b9-4064-9ea7- > f4770eca7247.client1615259# cat osdc > 22143 osd11 11.e4c3492 rbd_data.319f32ae8944a.000000000000000b r > ead > 23771 osd31 21.dda0f5af rbd_data.b6ce62ae8944a.000000000000e7a8 > set-alloc-hint,write > 23782 osd31 11.e505a6ea rbd_data.3dd222ae8944a.00000000000000c9 > read > 26228 osd2 11.ec37db43 rbd_data.319f32ae8944a.0000000000000006 > read > 26260 osd31 21.dda0f5af rbd_data.b6ce62ae8944a.000000000000e7a8 > set-alloc-hint,write > 26338 osd31 11.e505a6ea rbd_data.3dd222ae8944a.00000000000000c9 > read > root@roc-4r-scd214:/sys/kernel/debug/ceph/8d5c925a-f6b9-4064-9ea7- > f4770eca7247.client1615259# > > So the ongoing discussion is :is this ceph being slow or LIO being not resilient > enough". Ref: > > http://www.spinics.net/lists/target-devel/msg09311.html > > http://www.spinics.net/lists/target-devel/msg09682.html > > And especially for discussion about allowing iscsi login during another > hang: http://www.spinics.net/lists/target-devel/msg09687.html > and http://www.spinics.net/lists/target-devel/msg09688.html > > > > > We'll replace the whole network, but I was thinking, having seen such > issues > > at a few other sites, if a "B-bus" for networking would be a good design for > > OSDs. This approach is commonly used in traditional SANs, where the "A > > bus" and "B bus" are not connected,so they cannot possibly cross > > contaminate in any way. > > Probably implementing something like multipathTCP would be the best bet > to mirror the traditional dual fabric SAN design. > > Assuming http://www.multipath-tcp.org/ > and http://lwn.net/Articles/544399/ > > Looks very interesting. > > > > > > > > > Nick > > > > > >> -----Original Message----- > > >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > Behalf > > >> Of Alex Gorbachev > > >> Sent: 27 June 2015 19:02 > > >> To: ceph-users@xxxxxxxxxxxxxx > > >> Subject: Redundant networks in Ceph > > >> > > >> The current network design in Ceph > > >> (http://ceph.com/docs/master/rados/configuration/network-config- > ref) > > >> uses nonredundant networks for both cluster and public > communication. > > >> Ideally, in a high load environment these will be 10 or 40+ GbE networks. > > > For > > >> cost reasons, most such installation will use the same switch > > >> hardware and separate Ceph traffic using VLANs. > > >> > > >> Networking in complex, and situations are possible when switches and > > >> routers drop traffic. We ran into one of those at one of our sites - > > >> connections to hosts stay up (so bonding NICs does not help), yet OSD > > >> communication gets disrupted, client IO hangs and failures cascade to > > > client > > >> applications. > > >> > > >> My understanding is that if OSDs cannot connect for some time over > > >> the cluster network, that IO will hang and time out. The document > states > > " > > >> > > >> If you specify more than one IP address and subnet mask for either > > >> the public or the cluster network, the subnets within the network > > >> must be capable of routing to each other." > > >> > > >> Which in real world means complicated Layer 3 setup for routing and > > >> is not practical in many configurations. > > >> > > >> What if there was an option for "cluster 2" and "public 2" networks, > > >> to > > > which > > >> OSDs and MONs would go either in active/backup or active/active mode > > >> (cluster 1 and cluster 2 exist separately do not route to each other)? > > >> > > >> The difference between this setup and bonding is that here decision > > >> to > > > fail > > >> over and try the other network is at OSD/MON level, and it bring > > > resilience to > > >> faults within the switch core, which is really only detectable at > > > application > > >> layer. > > >> > > >> Am I missing an already existing feature? Please advise. > > >> > > >> Best regards, > > >> Alex Gorbachev > > >> Intelligent Systems Services Inc. > > >> _______________________________________________ > > >> ceph-users mailing list > > >> ceph-users@xxxxxxxxxxxxxx > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com