Re: High load during recovery (after disk placement)

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 23 Nov 2015 12:19:02 -0600

On Fri, Nov 20, 2015 at 11:33 AM, Simon Engelsman <simon@xxxxxxxxxxxx> wrote:
> Hi,
>
> We've experienced a very weird problem last week with our Ceph
> cluster. We would like to ask your opinion(s) and advice
>
> Our dedicated Ceph OSD nodes run with:
>
> Total platform
> - IO Average: 2500 wrps, ~ 600 rps
> - Replica's: 3x
>
> 2 pools:
> - SSD (~50x1TB)
> - Spinner (~36x2TB)
> - 1024 PG's per pool (so 2048 in total)
>
> Each node has:
> - 24GB of ram.
> - Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6-core)
> - 10Gbit connection (2x, 1x backend, 1x frontend access)
> - ~7 SSD OSD and ~5 HDD OSD
> - Ceph Hammer
> - Ubuntu 14.04.3
> - Stock kernels: 3.13.0-61-generic
> (8 hosts in total, monitors are dedicated machines)
>
>
> The SSD pool is our primary service delivery for delivering RBD images
> to VM's. The HDD pool is for massive data/slow data. Problems we had
> concentrated on the SSD part of the pool.
> We understand that each OSD with 1 terabyte should expect to run with
> 1GB of ram. With 7 SSD & 5 HDD, we cover about 17TB, so with our 24GB
> we are getting to the limit, but not yet over it.
>
>
> The problem:
>
> We added a new OSD (SSD) to the cluster, normally this starts with a
> bit of a higher load but doesn't make the cluster unstable to use. At
> some point in the process/recover, all client IO stalled/became
> extremly slow. While investigating, we discovered that not the added
> SSD was the culprit, but another OSD which was in the cluster for a
> longer period. We saw slow/blocked requests, up to a few hundred of them.
>
> Chronology of events:
> - Added a new OSD
> - Cluster started to recover/move PGs, went ok for a few hours
> - Cluster went unstable at some point We saw that an OSD was freaking
> out (not the added OSD),
> - We stopped the OSD (84% full), which solved the IO problems
> - We started the OSD again, resulting in the same problems, the disk
> ended up in backfill_toofull at some point (>85%).
> - We reweighted the OSD to 0.8, after that the recovery went smoothly.
>
> What could've caused this entire freakout? Was it the bad apple OSD
> which was 85% full? was it a combination of things? What is the
> experience of other people on this list? We have some questions we put
> below and would love to hear the experiences of other Ceph administrators.
>
>
> Questions we have:
> - Should we increase RAM in the nodes?
> - Should we enable trimming on the SSD's?
> - How much headroom should we keep in mind in terms of storage?
> (currently we have about ~65% full, the least full disk is ~29%, the
> most full disk ~80%)
> - Would it be good to have seperate clusters for spinning & ssd's?
> - If only one disk is the culprit, why does this affect I/O? Shouldn't
> Ceph discard/delay writes to this disk?
> - What CPU capacity is advisable? We can add a second E5-2620 v2 @
> 2.10GHz to 16-cores, is that advisable?
>
>
> What we do already to slowdown recovery:
>
> We only add disks one-by-one
> CFQ queuing is enabled
>
> settings:
> osd_backfill_scan_min = 4
> osd_backfill_scan_max = 8
> osd_max_backfills = 2
> osd_recovery_max_active = 1
>
> osd client op priority = 63
> osd recovery op priority = 1
>
> osd disk thread ioprio class = idle
> osd disk thread ioprio priority = 3 #0-7 within idle class

In addition to what Robert said, it sounds like you've done something
strange with your CRUSH map. Do you have separate trees for the SSDs
and hard drives, or are they both under the same host buckets?

You'll also want to dig into more general config stuff like PG counts etc.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com