Re: Redundant networks in Ceph

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Sat, 27 Jun 2015 16:55:18 -0400

Hi Nick,

Thank you fro writing back:

> I think the answer is you do 1 of 2 things. You either design your network
> so that it is fault tolerant in every way so that network interruption is
> not possible. Or go with non-redundant networking, but design your crush map
> around the failure domains of the network.

We'll redesign the network shortly - the problem is in general that I
am finding it is possible, in even well designed redundant networks,
to have packet loss occur for various reasons (maintenance, cables,
protocol issues etc.).  So while there is not an interruption (defined
as 100% service loss), there may be occasional packet loss issues and
high latency situations, even when the backbone is very fast.

The CRUSH map idea sounds interesting.  But there are still concerns,
such as massive data relocations East-West (between racks in a
leaf-spine architecture such as
https://community.mellanox.com/docs/DOC-1475 , should there be an
outage in the spine.  Plus such issues are enormously hard to
troubleshoot.

> I'm interested in your example of where OSD's where unable to communicate.
> What happened? Would it possible to redesign the network to stop this
> happening?

Our SuperCore design uses Ceph OSD nodes to provide storage to LIO
Target iSCSI nodes, which then deliver it to ESXi hosts.  LIO is
sensitive to hangs, and often we see an RBD hang translate into iSCSI
timeout, which causes ESXi to abort connections, hang and crash
applications.  This only happens at one site, where it is likely there
is a switch issue somewhere.  These issues are sporadic and come and
go as storms - so far all Ceph analysis pointed to network
disruptions, from which the RBD client is unable to recover.  The
network vendor still cannot find anything wrong.

We'll replace the whole network, but I was thinking, having seen such
issues at a few other sites, if a "B-bus" for networking would be a
good design for OSDs.  This approach is commonly used in traditional
SANs, where the "A bus" and "B bus" are not connected,so they cannot
possibly cross contaminate in any way.

Another reference is multipathing, where IO can be send via redundant
paths - most storage vendors recommend using application (higher)
level multipathing (aka MPIO) vs. network redundancy (such as
bonding).  We find this to be a valid recommendation as clients run
into issues less.  Somewhat related to
http://serverfault.com/questions/510882/why-mpio-instead-of-802-3ad-team-for-iscsi
to quote - "MPIO detects and handles path failures, whereas 802.3ad
can only compensate for a link failure".

I see OSD connections as paths, rather than links, as these are higher
level object storage exchanges.

Thank you,
Alex

>
> Nick
>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Alex Gorbachev
>> Sent: 27 June 2015 19:02
>> To: ceph-users@xxxxxxxxxxxxxx
>> Subject:  Redundant networks in Ceph
>>
>> The current network design in Ceph
>> (http://ceph.com/docs/master/rados/configuration/network-config-ref)
>> uses nonredundant networks for both cluster and public communication.
>> Ideally, in a high load environment these will be 10 or 40+ GbE networks.
> For
>> cost reasons, most such installation will use the same switch hardware and
>> separate Ceph traffic using VLANs.
>>
>> Networking in complex, and situations are possible when switches and
>> routers drop traffic.  We ran into one of those at one of our sites -
>> connections to hosts stay up (so bonding NICs does not help), yet OSD
>> communication gets disrupted, client IO hangs and failures cascade to
> client
>> applications.
>>
>> My understanding is that if OSDs cannot connect for some time over the
>> cluster network, that IO will hang and time out.  The document states "
>>
>> If you specify more than one IP address and subnet mask for either the
>> public or the cluster network, the subnets within the network must be
>> capable of routing to each other."
>>
>> Which in real world means complicated Layer 3 setup for routing and is not
>> practical in many configurations.
>>
>> What if there was an option for "cluster 2" and "public 2" networks, to
> which
>> OSDs and MONs would go either in active/backup or active/active mode
>> (cluster 1 and cluster 2 exist separately do not route to each other)?
>>
>> The difference between this setup and bonding is that here decision to
> fail
>> over and try the other network is at OSD/MON level, and it bring
> resilience to
>> faults within the switch core, which is really only detectable at
> application
>> layer.
>>
>> Am I missing an already existing feature?  Please advise.
>>
>> Best regards,
>> Alex Gorbachev
>> Intelligent Systems Services Inc.
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com