Re: OSDs going down when we bring down some OSD nodes Or cut-off the cluster network link between OSD nodes

Venkata Manojawa Paritala <manojawapv@xxxxxxxxxx> · Mon, 8 Aug 2016 15:30:40 +0530

Hi Christian,
Thank you very much for the reply. Please find my comments in-line.

Thanks & Regards,
Manoj

On Sun, Aug 7, 2016 at 3:26 PM, Christian Balzer <chibi@xxxxxxx> wrote:

[Reduced to ceph-users, this isn't community related]

Hello,

On Sat, 6 Aug 2016 20:23:41 +0530 Venkata Manojawa Paritala wrote:

> Hi,

>

> We have configured single Ceph cluster in a lab with the below

> specification.

>

> 1. Divided the cluster into 3 logical sites (SiteA, SiteB & SiteC). This is

> to simulate that nodes are part of different Data Centers and having

> network connectivity between them for DR.

You might want to search the ML archives, this has been discussed plenty

of times.

While DR and multi-site replication certainly is desirable, it is also

going to introduce painful latencies with Ceph, especially if your sites

aren't relatively close to each other (Metro, less than 10km fiber runs).

Manoj :- We have configured the delays on the ethernet ports. Between sites A & B we have a 0.2 ms delay (configured on SiteB). Between Sites B & C we have a delay of 5ms (configured on siteC).

The new rbd-mirror feature may or may not help in this kind of scenario,

see the posts about this just in the last few days.

Since you didn't explicitly mentioned it, you do have custom CRUSH rules

to distribute your data accordingly?

Manoj :- You guessed it right. We have configured rulesets in such a way that OSDs from all the 3 sites are picked up for replication.

> 2. Each site operates in a different subnet and each subnet is part of one

> VLAN. We have configured routing so that OSD nodes in one site can

> communicate to OSD nodes in the other 2 sites.

> 3. Each site will have one monitor  node, 2  OSD nodes (to which we have

> disks attached) and IO generating clients.

You will want more monitors in a production environment and depending on

the actual topology more "sites" to break ties.

For example if you have triangle setup, give your primary site 3 MONs

and the other sites 2 MONs each.

Of course this means if you loose all network links between your sites,

you still won't be able to reach quorum.

Manoj :- Ok. 

> 4. We have configured 2 networks.

> 4.1. Public network - To which all the clients, monitors and OSD nodes are

> connected

> 4.2. Cluster network - To which only the OSD nodes are connected for -

> Replication/recovery/hearbeat traffic.

>

Unless actually needed, I (and others) tend to avoid split networks, since

it can introduce "wonderful" failure scenarios, as you just found out.

The only reason for such a split network setup in my book is if your

storage nodes can write FASTER than the aggregate bandwidth of your

network links to those nodes.

Manoj :- We did not wanted the replication / recovery / heart beat traffic on the public network, so we configured a separate network for them. 

> 5. We have 2 issues here.

> 5.1. We are unable sustain IO for clients from individual sites when we

> isolate the OSD nodes by bringing down ONLY the cluster network between

> sites. Logically this will make the individual sites to be in isolation

> with respect to the cluster network. Please note that the public network is

> still connected between the sites.

>

See above, that's expected.

Though in a real world setup I'd expect both networks to fail (common fiber

trunk being severed) at the same time.

Again, instead of 2 networks you'll be better off with as single, but

fully redundant network.

Manoj :- You mean to say only one network (public) with 2 NICs on each of the Monitor & OSD nodes?

> 5.2. In a fully functional cluster, when we bring down 2 sites (shutdown

> the OSD services of 2 sites - say Site A OSDs and Site B OSDs) then, OSDs

> in the third site (Site C) are going down (OSD Flapping).

>

This is a bit unclear, if you only shut down the OSDs and MONs are still

running and have connectivity the cluster should have a working quorum still (the thing you're thinking about below).

OTOH, loosing 2/3rd of your OSDs with normal (min_size=2) replication

settings will lock your cluster up anyway.

Manoj :-  This was what we were guessing. Also, we observed the same issue when we tried with single replica also.

Regards,

Christian

> We need workarounds/solutions to  fix the above 2 issues.

>

> Below are some of the parameters we have already mentioned in the Cenf.conf

> to sustain the cluster for a longer time, when we cut-off the links between

> sites. But, they were not successful.

>

> --------------

> [global]

> public_network = 10.10.0.0/16

> cluster_network = 192.168.100.0/16,192.168.150.0/16,192.168.200.0/16

> osd hearbeat address = 172.16.0.0/16

>

> [monitor]

> mon osd report timeout = 1800

>

> [OSD}

> osd heartbeat interval = 12

> osd hearbeat grace = 60

> osd mon heartbeat interval = 60

> osd mon report interval max = 300

> osd mon report interval min = 10

> osd mon act timeout = 60

> .

> .

> ----------------

>

> We also confiured the parameter "osd_heartbeat_addr" and tried with the

> values - 1) Ceph public network (assuming that when we bring down the

> cluster network hearbeat should happen via public network). 2) Provided a

> different network range altogether and had physical connections. But both

> the options did not work.

>

> We have a total of 49 OSDs (14 in Site A, 14 in SiteB, 21 in SiteC) in the

> cluster. One Monitor in each Site.

>

> We need to try the below two options.

>

> A) Increase the "mon osd min down reporters" value. Question is how much.

> Say, if I give this value to 49, then will the client IO sustain when we

> cut-off the cluster network links between sites. In this case one issue

> would be that if the OSD is really down we wouldn't know.

>

> B) Add 2 monitors to each site. This would make each site with 3 monitors

> and the overall cluster will have 9 monitors. The reason we wanted to try

> this is, we think that the OSDs are going down as the the quorum is unable

> to find the minimum number nodes (may be monitors) to sustain.

>

> Thanks & Regards,

> Manoj

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com