Re: High load during recovery (after disk placement)

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Fri, 20 Nov 2015 10:52:16 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

We are seeing some of these issues as well, here is some things we have learned.

We found in our testing that enabling DISCARD as a mount option on the
SSD OSDs did not measurably affect performance (be sure to test on
your SSDs to be sure though). When our SSDs get full the performance
falls off, so we try to keep them under 70% full.

Your osd_backfill_scan_{min,max} settings are very aggressive for
recovery. This means that if the cluster is idle, it will schedule
recovery/backfill every 4 seconds. If the cluster is busy, it will
force recovery/backfill ops every 8 seconds. In our cluster with 10
drives weighted out and 6 weighted in, I found that the recovery would
start out pretty even between all OSDs, then over time, it would move
closer and closer to the times in the config. Instead of slow and
steady every second, it would be fast and furious for about 10-15
seconds, then no recovery for 45-50 seconds. Your settings in our
cluster would mean that there would be no time that it wasn't trying
to do recovery effectively starving client OPs.

I've also found that in our SSD OSDs, we can get a lot of blocked I/O
when one starts up. This is most likely due to the higher number of
I/Os happening on those OSDs compared to our spindle OSDs. These SSD
OSDs are taking a lot longer to boot up and client I/O hits the OSD as
soon as it is up but before it has peered all it's PGs and is ready to
service requests. I'm going to look deeper to see if there is a way to
adjust this behavior, but it doesn't look easy at all.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWT13NCRDmVDuy+mK58QAAm8cP/0g7i/cSx9bOrv7KAQib
tQO0a6+DrPnhdJ9KivHA53uHhxNlXOxNPBW3zlQmGJTfHleBX3/DDgZyLLKu
7k/59r/IVge1hrnFU8zL/22ISTXsDvtUCVKy8v2JiHMpa3FLt1wbCyMuvrVO
Da49rhY//3qorv1DREnBmbL0os/+VM4jqXaOe9RYhDMEQidGOF4sz5qJfVT1
37JbYRHqbwONjvRlrSJsKtKAXN8PDr7RFTP+Zr6nBtHj2yUzjWDIwUtQgxpF
S67xQkfavKdrmxFWb/LrAk4GVVQNWkT+G9g7sZSEiISw8yqK9t7htnTY6leO
cchB3ctiCvE2C35v+1FpEKDt2vNiKV9prO3wCYuULtJwHzW213BjQ4qV2/AM
ir8/ITDvBWhFMGoay2SOW/FQJOzJcB+lgL10t4qQPTRig3jcsW29V2fvJHS2
nn6FvfvOjLFHPF7UO64TM6mEqo6yREm1tH6nKOTYZr37V9WOg1PdNmPoU7am
yhhb78uHF1GNVC9TDllPxhO/Bx+8jPyNTTXiM2HMwa4cvQq5kguaCOoUCM/L
iBTcM9nE5DojMDc/aUjgNX/3wzHVqTSvXvdMZw0wEDKEa4NQRBFfjdpYzoLv
MreC+bHCpA/gN3KMFe4quNfUTAUrVDQ1rMDX3hNTqrIJjXg77ZFL0UMGiGwG
8B6j
=dG2z
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Fri, Nov 20, 2015 at 10:33 AM, Simon Engelsman <simon@xxxxxxxxxxxx> wrote:
> Hi,
>
> We've experienced a very weird problem last week with our Ceph
> cluster. We would like to ask your opinion(s) and advice
>
> Our dedicated Ceph OSD nodes run with:
>
> Total platform
> - IO Average: 2500 wrps, ~ 600 rps
> - Replica's: 3x
>
> 2 pools:
> - SSD (~50x1TB)
> - Spinner (~36x2TB)
> - 1024 PG's per pool (so 2048 in total)
>
> Each node has:
> - 24GB of ram.
> - Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6-core)
> - 10Gbit connection (2x, 1x backend, 1x frontend access)
> - ~7 SSD OSD and ~5 HDD OSD
> - Ceph Hammer
> - Ubuntu 14.04.3
> - Stock kernels: 3.13.0-61-generic
> (8 hosts in total, monitors are dedicated machines)
>
>
> The SSD pool is our primary service delivery for delivering RBD images
> to VM's. The HDD pool is for massive data/slow data. Problems we had
> concentrated on the SSD part of the pool.
> We understand that each OSD with 1 terabyte should expect to run with
> 1GB of ram. With 7 SSD & 5 HDD, we cover about 17TB, so with our 24GB
> we are getting to the limit, but not yet over it.
>
>
> The problem:
>
> We added a new OSD (SSD) to the cluster, normally this starts with a
> bit of a higher load but doesn't make the cluster unstable to use. At
> some point in the process/recover, all client IO stalled/became
> extremly slow. While investigating, we discovered that not the added
> SSD was the culprit, but another OSD which was in the cluster for a
> longer period. We saw slow/blocked requests, up to a few hundred of them.
>
> Chronology of events:
> - Added a new OSD
> - Cluster started to recover/move PGs, went ok for a few hours
> - Cluster went unstable at some point We saw that an OSD was freaking
> out (not the added OSD),
> - We stopped the OSD (84% full), which solved the IO problems
> - We started the OSD again, resulting in the same problems, the disk
> ended up in backfill_toofull at some point (>85%).
> - We reweighted the OSD to 0.8, after that the recovery went smoothly.
>
> What could've caused this entire freakout? Was it the bad apple OSD
> which was 85% full? was it a combination of things? What is the
> experience of other people on this list? We have some questions we put
> below and would love to hear the experiences of other Ceph administrators.
>
>
> Questions we have:
> - Should we increase RAM in the nodes?
> - Should we enable trimming on the SSD's?
> - How much headroom should we keep in mind in terms of storage?
> (currently we have about ~65% full, the least full disk is ~29%, the
> most full disk ~80%)
> - Would it be good to have seperate clusters for spinning & ssd's?
> - If only one disk is the culprit, why does this affect I/O? Shouldn't
> Ceph discard/delay writes to this disk?
> - What CPU capacity is advisable? We can add a second E5-2620 v2 @
> 2.10GHz to 16-cores, is that advisable?
>
>
> What we do already to slowdown recovery:
>
> We only add disks one-by-one
> CFQ queuing is enabled
>
> settings:
> osd_backfill_scan_min = 4
> osd_backfill_scan_max = 8
> osd_max_backfills = 2
> osd_recovery_max_active = 1
>
> osd client op priority = 63
> osd recovery op priority = 1
>
> osd disk thread ioprio class = idle
> osd disk thread ioprio priority = 3 #0-7 within idle class
>
>
> Kind regards,
> Simon
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com