Re: ceph.conf tuning ... please comment

"Van Leeuwen, Robert" <rovanleeuwen@xxxxxxxx> · Wed, 6 Dec 2017 12:23:27 +0000

Hi,

Lets start with a disclaimer: Not an expert on any of these ceph tuning settings :)

However, in general with cluster intervals/timings:
You are trading quick failovers detection for:
1) Processing power: 
You might starve yourself of resources when expanding the cluster.
If you multiply all the changes you might actually create a lot of network chatter and need a lot of extra processing power.
With this in mind I would be very careful changing any setting by an order of magnitude unless you exactly know the impact.
2) You might create a very “nervous” cluster. E.g. a short network hiccup: OSDs marked out.
This creates a large amount of data shuffling which could flood a network link which could create more short network hiccups which would create flapping OSDs etc.

IMHO would change the mindset on tuning from: 
- what is the fastest possible failure detection time of a broken datacenter
to:
- In case of catastrophic DC failure (which happens maybe once every few years) are the defaults that bad that you must change from what is the widely-tested deployment?

Off course it might be that the defaults are just a random number a dev put in and this is exactly what should be done in each deployment ;)
I am sure some other people have better insights in these specific settings.

Cheers,
Robert van Leeuwen

On 12/6/17, 7:01 AM, "ceph-users on behalf of Stefan Kooman" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of stefan@xxxxxx> wrote:

    Dear list,

    In a ceph blog post about the new Luminous release there is a paragraph
    on the need for ceph tuning [1]:

    "If you are a Ceph power user and believe there is some setting that you
    need to change for your environment to get the best performance, please
    tell uswed like to either adjust our defaults so that your change isnt
    necessary or have a go at convincing you that you shouldnt be tuning
    that option."

    We have been tuning several ceph.conf parameters in order to allow for
    "fast failure" when an entire datacenter goes offline. We now have
    continued operation (no pending IO) after ~ 7 seconds. We have changed
    the following parameters:

    [global]
    # https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdocs.ceph.com%2Fdocs%2Fmaster%2Frados%2Fconfiguration%2Fmon-osd-interaction%2F&data=02%7C01%7Crovanleeuwen%40ebay.com%7C0fba2ce3537b4bf4de9308d53c6f066d%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636481369899935585&sdata=arg0XlbGTI6w3Kr2Rf0KRPa5U3VJS5pyDGpgA8NzX%2FA%3D&reserved=0
    osd heartbeat grace = 4                  # default 6
    # Do _NOT_  scale based on laggy estimations
    mon osd adjust heartbeat grace = false

    ^^ without this setting it could take up to two minutes before ceph
    flagged a whole datacenter down (after we cut connectivity to the DC).
    Not sure how the estimation is done, but not good enough for us.

    [mon]
    # https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdocs.ceph.com%2Fdocs%2Fmaster%2Frados%2Fconfiguration%2Fmon-config-ref%2F&data=02%7C01%7Crovanleeuwen%40ebay.com%7C0fba2ce3537b4bf4de9308d53c6f066d%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636481369899935585&sdata=tYkO4M5iiK38u%2By0ZMBdXEROH2OtfoQM2iPSnAHc12k%3D&reserved=0
    # TUNING #
    mon lease = 1.0                        # default 5
    mon election timeout = 2               # default 5 
    mon lease renew interval factor = 0.4  # default 0.6
    mon lease ack timeout factor = 1.5     # default 2.0
    mon timecheck interval = 60            # default 300

    Above checks are there to make the whole process faster. After a DC
    failure the monitors will need a re-election (depending on what DC and
    who was a leader and who were peon). While going through mon
    debug logging we have observed that this whole process is really fast
    (things happen to be done in milliseconds). We have a quite low latency
    network, so I guess we can cut some slack here. Ceph won't make any
    decisions while there is no consensus, so better get that consensus as
    soon as possible.

    # https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdocs.ceph.com%2Fdocs%2Fmaster%2Frados%2Fconfiguration%2Fmon-osd-interaction%2F%23monitor-settings&data=02%7C01%7Crovanleeuwen%40ebay.com%7C0fba2ce3537b4bf4de9308d53c6f066d%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636481369899935585&sdata=VRr2wpmbugh9YodMppLP24odLyHKJASh%2BQ%2FcGjog568%3D&reserved=0 
    mon osd reporter subtree level = datacenter

    ^^ We do want to make sure at least two datacenters are seeing a
    datacenter go down, not individual hosts.

    [osd]
    # https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdocs.ceph.com%2Fdocs%2Fmaster%2Frados%2Fconfiguration%2Fmon-osd-interaction%2F&data=02%7C01%7Crovanleeuwen%40ebay.com%7C0fba2ce3537b4bf4de9308d53c6f066d%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636481369899935585&sdata=arg0XlbGTI6w3Kr2Rf0KRPa5U3VJS5pyDGpgA8NzX%2FA%3D&reserved=0
    osd crush update on start = false
    osd heartbeat interval = 1             # default 6
    osd mon heartbeat interval = 10        # default 30
    osd mon report interval min = 1        # default 5
    osd mon report interval max = 15       # default 120

    The osd would almost immediately see a "cut off" to their partner OSD's
    in the placement group. By default they wait 6 seconds before sending
    their report to the monitors. During our analysis this is exactly the
    time the monitors were keeping an election. By tuning all of the above
    we could get them to send their reports faster, and by the time the
    election process was finished the monitors would handle the reports from
    the OSDs and come to the conclusion that a DC is down, flag it down
    and allow for normal client IO again.

    Of course, stability and data safety is most important to us. So if any
    of these settings make you worry please let us know.

    Gr. Stefan

    [1]: https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fceph.com%2Fcommunity%2Fnew-luminous-rados-improvements%2F&data=02%7C01%7Crovanleeuwen%40ebay.com%7C0fba2ce3537b4bf4de9308d53c6f066d%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636481369899935585&sdata=z888vKUYfYaZahwEbGnave4%2F9FcBBZ%2FfzcJQeFp7YAQ%3D&reserved=0

    -- 
    | BIT BV  https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.bit.nl%2F&data=02%7C01%7Crovanleeuwen%40ebay.com%7C0fba2ce3537b4bf4de9308d53c6f066d%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636481369899935585&sdata=nqPladpP6VbLYSeRHEbseEL4JSmFRS0s%2BhRCTxzAQng%3D&reserved=0        Kamer van Koophandel 09090351
    | GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx
    https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.ceph.com%2Flistinfo.cgi%2Fceph-users-ceph.com&data=02%7C01%7Crovanleeuwen%40ebay.com%7C0fba2ce3537b4bf4de9308d53c6f066d%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636481369899935585&sdata=dUfOPM8m18TgcyM9ob%2BIVC8NXA%2BfK%2Bz1lM3YjpAq7n0%3D&reserved=0

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com