Re: OSDs going down when we bring down some OSD nodes Or cut-off the cluster network link between OSD nodes

Shinobu Kinjo <shinobu.kj@xxxxxxxxx> · Sun, 7 Aug 2016 19:17:14 +0900

On Sun, Aug 7, 2016 at 6:56 PM, Christian Balzer <chibi@xxxxxxx> wrote:
>
> [Reduced to ceph-users, this isn't community related]
>
> Hello,
>
> On Sat, 6 Aug 2016 20:23:41 +0530 Venkata Manojawa Paritala wrote:
>
>> Hi,
>>
>> We have configured single Ceph cluster in a lab with the below
>> specification.
>>
>> 1. Divided the cluster into 3 logical sites (SiteA, SiteB & SiteC). This is
>> to simulate that nodes are part of different Data Centers and having
>> network connectivity between them for DR.
>
> You might want to search the ML archives, this has been discussed plenty
> of times.
> While DR and multi-site replication certainly is desirable, it is also
> going to introduce painful latencies with Ceph, especially if your sites
> aren't relatively close to each other (Metro, less than 10km fiber runs).
>
> The new rbd-mirror feature may or may not help in this kind of scenario,
> see the posts about this just in the last few days.
>
> Since you didn't explicitly mentioned it, you do have custom CRUSH rules
> to distribute your data accordingly?
>
>> 2. Each site operates in a different subnet and each subnet is part of one
>> VLAN. We have configured routing so that OSD nodes in one site can
>> communicate to OSD nodes in the other 2 sites.
>> 3. Each site will have one monitor  node, 2  OSD nodes (to which we have
>> disks attached) and IO generating clients.
>
> You will want more monitors in a production environment and depending on
> the actual topology more "sites" to break ties.
>
> For example if you have triangle setup, give your primary site 3 MONs
> and the other sites 2 MONs each.
>
> Of course this means if you loose all network links between your sites,
> you still won't be able to reach quorum.
>
>> 4. We have configured 2 networks.
>> 4.1. Public network - To which all the clients, monitors and OSD nodes are
>> connected
>> 4.2. Cluster network - To which only the OSD nodes are connected for -
>> Replication/recovery/hearbeat traffic.
>>
> Unless actually needed, I (and others) tend to avoid split networks, since
> it can introduce "wonderful" failure scenarios, as you just found out.
>
> The only reason for such a split network setup in my book is if your
> storage nodes can write FASTER than the aggregate bandwidth of your
> network links to those nodes.
>
>> 5. We have 2 issues here.
>> 5.1. We are unable sustain IO for clients from individual sites when we
>> isolate the OSD nodes by bringing down ONLY the cluster network between
>> sites. Logically this will make the individual sites to be in isolation
>> with respect to the cluster network. Please note that the public network is
>> still connected between the sites.
>>
> See above, that's expected.
> Though in a real world setup I'd expect both networks to fail (common fiber
> trunk being severed) at the same time.
>
> Again, instead of 2 networks you'll be better off with as single, but
> fully redundant network.
>
>> 5.2. In a fully functional cluster, when we bring down 2 sites (shutdown
>> the OSD services of 2 sites - say Site A OSDs and Site B OSDs) then, OSDs
>> in the third site (Site C) are going down (OSD Flapping).
>>
>
> This is a bit unclear, if you only shut down the OSDs and MONs are still
> running and have connectivity the cluster should have a working quorum
> still (the thing you're thinking about below).
>
> OTOH, loosing 2/3rd of your OSDs with normal (min_size=2) replication
> settings will lock your cluster up anyway.
>
> Regards,
>
> Christian
>
>> We need workarounds/solutions to  fix the above 2 issues.
>>
>> Below are some of the parameters we have already mentioned in the Cenf.conf
>> to sustain the cluster for a longer time, when we cut-off the links between
>> sites. But, they were not successful.
>>
>> --------------
>> [global]
>> public_network = 10.10.0.0/16
>> cluster_network = 192.168.100.0/16,192.168.150.0/16,192.168.200.0/16
>> osd hearbeat address = 172.16.0.0/16
>>
>> [monitor]
>> mon osd report timeout = 1800
>>
>> [OSD}

Typo?

>> osd heartbeat interval = 12
>> osd hearbeat grace = 60
>> osd mon heartbeat interval = 60
>> osd mon report interval max = 300
>> osd mon report interval min = 10
>> osd mon act timeout = 60
>> .
>> .
>> ----------------
>>
>> We also confiured the parameter "osd_heartbeat_addr" and tried with the
>> values - 1) Ceph public network (assuming that when we bring down the
>> cluster network hearbeat should happen via public network). 2) Provided a
>> different network range altogether and had physical connections. But both
>> the options did not work.
>>
>> We have a total of 49 OSDs (14 in Site A, 14 in SiteB, 21 in SiteC) in the
>> cluster. One Monitor in each Site.
>>
>> We need to try the below two options.
>>
>> A) Increase the "mon osd min down reporters" value. Question is how much.
>> Say, if I give this value to 49, then will the client IO sustain when we
>> cut-off the cluster network links between sites. In this case one issue
>> would be that if the OSD is really down we wouldn't know.
>>
>> B) Add 2 monitors to each site. This would make each site with 3 monitors
>> and the overall cluster will have 9 monitors. The reason we wanted to try
>> this is, we think that the OSDs are going down as the the quorum is unable
>> to find the minimum number nodes (may be monitors) to sustain.
>>
>> Thanks & Regards,
>> Manoj
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Email:
shinobu@xxxxxxxxx
shinobu@xxxxxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com