Re: Spreading deep-scrubbing load

Christian Balzer <chibi@xxxxxxx> · Wed, 15 Jun 2016 10:18:24 +0900

Hello,

On Wed, 15 Jun 2016 00:01:42 +0000 Jared Curtis wrote:

> I’ve just started looking into one of our ceph clusters because a weekly
> deep scrub had a major IO impact on the cluster which caused multiple
> VMs to grind to a halt.
> 
A story you will find aplenty in the ML archives.

> So far I’ve discovered that this particular cluster is configured
> incorrectly for the number of PGS per OSD. Currently that setting is 6
> but should be closer to ~4096 based on the calc tool.
> 
You're having a case of apples and oranges here.
PGs (and PGPs, don't forget them!) are configured per pool, the amount of
PGs per OSD is a result of all PGs in all pools.

Output of "ceph osd pool ls detail" would be helpful for us.

> If I change the number of PGS to the suggested values what should I
> expect specially around the deep scrub performance but also just in
> general as I’m very new to ceph. 

We're not psychic. 
The amount of PGs will have an impact, but that very much depends on your
existing setup.
So the usual, all versions (Ceph/OS), detailed cluster description (all HW
details down to the SSD model if you have them, network, etc).

Generally speaking, deep-scrub is a very expensive operation with a
questionable value, see the current "Disk failures" thread for example.

That said, your cluster should be able to cope with it, as the
deep-scrub impact is a lot like what you'd get from recovery and/or
backfilling operations. 
Think of deep-scrub causing pain as an early warning sign that your
cluster is underpowered and/or badly configured.

>What I’m hoping will happen is that
> instead of a single weekly deep scrub that runs for 24+ hours we would
> have lots of smaller deep scrubs that can hopefully finish in a
> reasonable time with minimal cluster impact.
> 
Google and the (albeit often lacking behind) documentation are your
friends.

These are scrub related configuration parameters, this sample is from my
Hammer test cluster and comments below for relevant ones:

    "osd_scrub_thread_timeout": "60",
    "osd_scrub_thread_suicide_timeout": "300",
    "osd_scrub_finalize_thread_timeout": "600",
    "osd_scrub_invalid_stats": "true",
    "osd_max_scrubs": "1",
Default AFAIK, no more than one scrub per OSD, alas deep scrubs from other
OSDs of course might want data from this one as well.

    "osd_scrub_begin_hour": "0",
    "osd_scrub_end_hour": "6",
These 2 are perfect if your cluster can finish a deep scrub within
off-peak hours.

    "osd_scrub_load_threshold": "0.5",
Adjust to not starve your I/O.

    "osd_scrub_min_interval": "86400",
    "osd_scrub_max_interval": "604800",
    "osd_scrub_interval_randomize_ratio": "0.5",
Latest Hammer and afterwards can randomize things (spreading the load out),
but if you want things to happen within a certain time frame this might
not be helpful.

    "osd_scrub_chunk_min": "5",
    "osd_scrub_chunk_max": "25",
    "osd_scrub_sleep": "0.1",
This will allow client I/O to get a foot in and tends to be the biggest
help in Hammer and before. In Jewel the combined I/O queue should help a
lot as well.

    "osd_deep_scrub_interval": "604800",
Once that's exceeded, Ceph will deep-scrub, come hell or high water,
ignoring at the very least the load setting above.

    "osd_deep_scrub_stride": "524288",
    "osd_deep_scrub_update_digest_min_age": "7200",
    "osd_debug_scrub_chance_rewrite_digest": "0",

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com