Re: Redundant networks in Ceph

Nick Fisk <nick@xxxxxxxxxx> · Sat, 27 Jun 2015 22:20:08 +0100

> -----Original Message-----
> From: Alex Gorbachev [mailto:ag@xxxxxxxxxxxxxxxxxxx]
> Sent: 27 June 2015 21:55
> To: Nick Fisk
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Redundant networks in Ceph
> 
> Hi Nick,
> 
> Thank you fro writing back:
> 
> > I think the answer is you do 1 of 2 things. You either design your
> > network so that it is fault tolerant in every way so that network
> > interruption is not possible. Or go with non-redundant networking, but
> > design your crush map around the failure domains of the network.
> 
> We'll redesign the network shortly - the problem is in general that I am
> finding it is possible, in even well designed redundant networks, to have
> packet loss occur for various reasons (maintenance, cables, protocol issues
> etc.).  So while there is not an interruption (defined as 100% service loss),
> there may be occasional packet loss issues and high latency situations, even
> when the backbone is very fast.

I know what you mean, no matter how hard you try something unexpected always happens. That said I think OSD timeouts should be higher than HSRP and spanning tree convergence times, so I think it should survive most incidents that I can think of. 

> 
> The CRUSH map idea sounds interesting.  But there are still concerns, such as
> massive data relocations East-West (between racks in a leaf-spine
> architecture such as
> https://community.mellanox.com/docs/DOC-1475 , should there be an
> outage in the spine.  Plus such issues are enormously hard to troubleshoot.

You can set the maximum crush grouping that will allow OSD's to be marked out. You can use this to stop unwanted data movement from occurring during outages.

> 
> > I'm interested in your example of where OSD's where unable to
> communicate.
> > What happened? Would it possible to redesign the network to stop this
> > happening?
> 
> Our SuperCore design uses Ceph OSD nodes to provide storage to LIO Target
> iSCSI nodes, which then deliver it to ESXi hosts.  LIO is sensitive to hangs, and
> often we see an RBD hang translate into iSCSI timeout, which causes ESXi to
> abort connections, hang and crash applications.  This only happens at one
> site, where it is likely there is a switch issue somewhere.  These issues are
> sporadic and come and go as storms - so far all Ceph analysis pointed to
> network disruptions, from which the RBD client is unable to recover.  The
> network vendor still cannot find anything wrong.

Ah, yeah, been there with LIO and esxi and gave up on it. I found any pause longer than around 10 seconds would send both of them into a death spiral. I know you currently only see it due to some networking blip, but you will most likely also see it when disks fail...etc For me I couldn't have all my Datastores going down every time something blipped or got a little slow. There are discussions ongoing about it on the Target mailing list and Mike Christie from Redhat is looking into the problem, so hopefully it will get sorted at some point. For what it's worth, both SCST and TGT seem to be immune from this.

> 
> We'll replace the whole network, but I was thinking, having seen such issues
> at a few other sites, if a "B-bus" for networking would be a good design for
> OSDs.  This approach is commonly used in traditional SANs, where the "A
> bus" and "B bus" are not connected,so they cannot possibly cross
> contaminate in any way.

Probably implementing something like multipathTCP would be the best bet to mirror the traditional dual fabric SAN design. 

> 
> Another reference is multipathing, where IO can be send via redundant
> paths - most storage vendors recommend using application (higher) level
> multipathing (aka MPIO) vs. network redundancy (such as bonding).  We find
> this to be a valid recommendation as clients run into issues less.  Somewhat
> related to http://serverfault.com/questions/510882/why-mpio-instead-of-
> 802-3ad-team-for-iscsi
> to quote - "MPIO detects and handles path failures, whereas 802.3ad can
> only compensate for a link failure".
> 
> I see OSD connections as paths, rather than links, as these are higher level
> object storage exchanges.
> 
> Thank you,
> Alex
> 
> >
> > Nick
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of Alex Gorbachev
> >> Sent: 27 June 2015 19:02
> >> To: ceph-users@xxxxxxxxxxxxxx
> >> Subject:  Redundant networks in Ceph
> >>
> >> The current network design in Ceph
> >> (http://ceph.com/docs/master/rados/configuration/network-config-ref)
> >> uses nonredundant networks for both cluster and public communication.
> >> Ideally, in a high load environment these will be 10 or 40+ GbE networks.
> > For
> >> cost reasons, most such installation will use the same switch
> >> hardware and separate Ceph traffic using VLANs.
> >>
> >> Networking in complex, and situations are possible when switches and
> >> routers drop traffic.  We ran into one of those at one of our sites -
> >> connections to hosts stay up (so bonding NICs does not help), yet OSD
> >> communication gets disrupted, client IO hangs and failures cascade to
> > client
> >> applications.
> >>
> >> My understanding is that if OSDs cannot connect for some time over
> >> the cluster network, that IO will hang and time out.  The document states
> "
> >>
> >> If you specify more than one IP address and subnet mask for either
> >> the public or the cluster network, the subnets within the network
> >> must be capable of routing to each other."
> >>
> >> Which in real world means complicated Layer 3 setup for routing and
> >> is not practical in many configurations.
> >>
> >> What if there was an option for "cluster 2" and "public 2" networks,
> >> to
> > which
> >> OSDs and MONs would go either in active/backup or active/active mode
> >> (cluster 1 and cluster 2 exist separately do not route to each other)?
> >>
> >> The difference between this setup and bonding is that here decision
> >> to
> > fail
> >> over and try the other network is at OSD/MON level, and it bring
> > resilience to
> >> faults within the switch core, which is really only detectable at
> > application
> >> layer.
> >>
> >> Am I missing an already existing feature?  Please advise.
> >>
> >> Best regards,
> >> Alex Gorbachev
> >> Intelligent Systems Services Inc.
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com