Dear list, In a ceph blog post about the new Luminous release there is a paragraph on the need for ceph tuning [1]: "If you are a Ceph power user and believe there is some setting that you need to change for your environment to get the best performance, please tell uswed like to either adjust our defaults so that your change isnt necessary or have a go at convincing you that you shouldnt be tuning that option." We have been tuning several ceph.conf parameters in order to allow for "fast failure" when an entire datacenter goes offline. We now have continued operation (no pending IO) after ~ 7 seconds. We have changed the following parameters: [global] # http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/ osd heartbeat grace = 4 # default 6 # Do _NOT_ scale based on laggy estimations mon osd adjust heartbeat grace = false ^^ without this setting it could take up to two minutes before ceph flagged a whole datacenter down (after we cut connectivity to the DC). Not sure how the estimation is done, but not good enough for us. [mon] # http://docs.ceph.com/docs/master/rados/configuration/mon-config-ref/ # TUNING # mon lease = 1.0 # default 5 mon election timeout = 2 # default 5 mon lease renew interval factor = 0.4 # default 0.6 mon lease ack timeout factor = 1.5 # default 2.0 mon timecheck interval = 60 # default 300 Above checks are there to make the whole process faster. After a DC failure the monitors will need a re-election (depending on what DC and who was a leader and who were peon). While going through mon debug logging we have observed that this whole process is really fast (things happen to be done in milliseconds). We have a quite low latency network, so I guess we can cut some slack here. Ceph won't make any decisions while there is no consensus, so better get that consensus as soon as possible. # http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/#monitor-settings mon osd reporter subtree level = datacenter ^^ We do want to make sure at least two datacenters are seeing a datacenter go down, not individual hosts. [osd] # http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/ osd crush update on start = false osd heartbeat interval = 1 # default 6 osd mon heartbeat interval = 10 # default 30 osd mon report interval min = 1 # default 5 osd mon report interval max = 15 # default 120 The osd would almost immediately see a "cut off" to their partner OSD's in the placement group. By default they wait 6 seconds before sending their report to the monitors. During our analysis this is exactly the time the monitors were keeping an election. By tuning all of the above we could get them to send their reports faster, and by the time the election process was finished the monitors would handle the reports from the OSDs and come to the conclusion that a DC is down, flag it down and allow for normal client IO again. Of course, stability and data safety is most important to us. So if any of these settings make you worry please let us know. Gr. Stefan [1]: http://ceph.com/community/new-luminous-rados-improvements/ -- | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com