On Sun, Aug 7, 2016 at 6:56 PM, Christian Balzer <chibi@xxxxxxx> wrote: > > [Reduced to ceph-users, this isn't community related] > > Hello, > > On Sat, 6 Aug 2016 20:23:41 +0530 Venkata Manojawa Paritala wrote: > >> Hi, >> >> We have configured single Ceph cluster in a lab with the below >> specification. >> >> 1. Divided the cluster into 3 logical sites (SiteA, SiteB & SiteC). This is >> to simulate that nodes are part of different Data Centers and having >> network connectivity between them for DR. > > You might want to search the ML archives, this has been discussed plenty > of times. > While DR and multi-site replication certainly is desirable, it is also > going to introduce painful latencies with Ceph, especially if your sites > aren't relatively close to each other (Metro, less than 10km fiber runs). > > The new rbd-mirror feature may or may not help in this kind of scenario, > see the posts about this just in the last few days. > > Since you didn't explicitly mentioned it, you do have custom CRUSH rules > to distribute your data accordingly? > >> 2. Each site operates in a different subnet and each subnet is part of one >> VLAN. We have configured routing so that OSD nodes in one site can >> communicate to OSD nodes in the other 2 sites. >> 3. Each site will have one monitor node, 2 OSD nodes (to which we have >> disks attached) and IO generating clients. > > You will want more monitors in a production environment and depending on > the actual topology more "sites" to break ties. > > For example if you have triangle setup, give your primary site 3 MONs > and the other sites 2 MONs each. > > Of course this means if you loose all network links between your sites, > you still won't be able to reach quorum. > >> 4. We have configured 2 networks. >> 4.1. Public network - To which all the clients, monitors and OSD nodes are >> connected >> 4.2. Cluster network - To which only the OSD nodes are connected for - >> Replication/recovery/hearbeat traffic. >> > Unless actually needed, I (and others) tend to avoid split networks, since > it can introduce "wonderful" failure scenarios, as you just found out. > > The only reason for such a split network setup in my book is if your > storage nodes can write FASTER than the aggregate bandwidth of your > network links to those nodes. > >> 5. We have 2 issues here. >> 5.1. We are unable sustain IO for clients from individual sites when we >> isolate the OSD nodes by bringing down ONLY the cluster network between >> sites. Logically this will make the individual sites to be in isolation >> with respect to the cluster network. Please note that the public network is >> still connected between the sites. >> > See above, that's expected. > Though in a real world setup I'd expect both networks to fail (common fiber > trunk being severed) at the same time. > > Again, instead of 2 networks you'll be better off with as single, but > fully redundant network. > >> 5.2. In a fully functional cluster, when we bring down 2 sites (shutdown >> the OSD services of 2 sites - say Site A OSDs and Site B OSDs) then, OSDs >> in the third site (Site C) are going down (OSD Flapping). >> > > This is a bit unclear, if you only shut down the OSDs and MONs are still > running and have connectivity the cluster should have a working quorum > still (the thing you're thinking about below). > > OTOH, loosing 2/3rd of your OSDs with normal (min_size=2) replication > settings will lock your cluster up anyway. > > Regards, > > Christian > >> We need workarounds/solutions to fix the above 2 issues. >> >> Below are some of the parameters we have already mentioned in the Cenf.conf >> to sustain the cluster for a longer time, when we cut-off the links between >> sites. But, they were not successful. >> >> -------------- >> [global] >> public_network = 10.10.0.0/16 >> cluster_network = 192.168.100.0/16,192.168.150.0/16,192.168.200.0/16 >> osd hearbeat address = 172.16.0.0/16 >> >> [monitor] >> mon osd report timeout = 1800 >> >> [OSD} Typo? >> osd heartbeat interval = 12 >> osd hearbeat grace = 60 >> osd mon heartbeat interval = 60 >> osd mon report interval max = 300 >> osd mon report interval min = 10 >> osd mon act timeout = 60 >> . >> . >> ---------------- >> >> We also confiured the parameter "osd_heartbeat_addr" and tried with the >> values - 1) Ceph public network (assuming that when we bring down the >> cluster network hearbeat should happen via public network). 2) Provided a >> different network range altogether and had physical connections. But both >> the options did not work. >> >> We have a total of 49 OSDs (14 in Site A, 14 in SiteB, 21 in SiteC) in the >> cluster. One Monitor in each Site. >> >> We need to try the below two options. >> >> A) Increase the "mon osd min down reporters" value. Question is how much. >> Say, if I give this value to 49, then will the client IO sustain when we >> cut-off the cluster network links between sites. In this case one issue >> would be that if the OSD is really down we wouldn't know. >> >> B) Add 2 monitors to each site. This would make each site with 3 monitors >> and the overall cluster will have 9 monitors. The reason we wanted to try >> this is, we think that the OSDs are going down as the the quorum is unable >> to find the minimum number nodes (may be monitors) to sustain. >> >> Thanks & Regards, >> Manoj > > > -- > Christian Balzer Network/Systems Engineer > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Email: shinobu@xxxxxxxxx shinobu@xxxxxxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com