High load during recovery (after disk placement)

Simon Engelsman <simon@xxxxxxxxxxxx> · Fri, 20 Nov 2015 18:33:37 +0100

Hi,

We've experienced a very weird problem last week with our Ceph
cluster. We would like to ask your opinion(s) and advice

Our dedicated Ceph OSD nodes run with:

Total platform
- IO Average: 2500 wrps, ~ 600 rps
- Replica's: 3x

2 pools:
- SSD (~50x1TB)
- Spinner (~36x2TB)
- 1024 PG's per pool (so 2048 in total)

Each node has:
- 24GB of ram.
- Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6-core)
- 10Gbit connection (2x, 1x backend, 1x frontend access)
- ~7 SSD OSD and ~5 HDD OSD
- Ceph Hammer
- Ubuntu 14.04.3
- Stock kernels: 3.13.0-61-generic
(8 hosts in total, monitors are dedicated machines)

The SSD pool is our primary service delivery for delivering RBD images
to VM's. The HDD pool is for massive data/slow data. Problems we had
concentrated on the SSD part of the pool.
We understand that each OSD with 1 terabyte should expect to run with
1GB of ram. With 7 SSD & 5 HDD, we cover about 17TB, so with our 24GB
we are getting to the limit, but not yet over it.

The problem:

We added a new OSD (SSD) to the cluster, normally this starts with a
bit of a higher load but doesn't make the cluster unstable to use. At
some point in the process/recover, all client IO stalled/became
extremly slow. While investigating, we discovered that not the added
SSD was the culprit, but another OSD which was in the cluster for a
longer period. We saw slow/blocked requests, up to a few hundred of them.

Chronology of events:
- Added a new OSD
- Cluster started to recover/move PGs, went ok for a few hours
- Cluster went unstable at some point We saw that an OSD was freaking
out (not the added OSD),
- We stopped the OSD (84% full), which solved the IO problems
- We started the OSD again, resulting in the same problems, the disk
ended up in backfill_toofull at some point (>85%).
- We reweighted the OSD to 0.8, after that the recovery went smoothly.

What could've caused this entire freakout? Was it the bad apple OSD
which was 85% full? was it a combination of things? What is the
experience of other people on this list? We have some questions we put
below and would love to hear the experiences of other Ceph administrators.

Questions we have:
- Should we increase RAM in the nodes?
- Should we enable trimming on the SSD's?
- How much headroom should we keep in mind in terms of storage?
(currently we have about ~65% full, the least full disk is ~29%, the
most full disk ~80%)
- Would it be good to have seperate clusters for spinning & ssd's?
- If only one disk is the culprit, why does this affect I/O? Shouldn't
Ceph discard/delay writes to this disk?
- What CPU capacity is advisable? We can add a second E5-2620 v2 @
2.10GHz to 16-cores, is that advisable?

What we do already to slowdown recovery:

We only add disks one-by-one
CFQ queuing is enabled

settings:
osd_backfill_scan_min = 4
osd_backfill_scan_max = 8
osd_max_backfills = 2
osd_recovery_max_active = 1

osd client op priority = 63
osd recovery op priority = 1

osd disk thread ioprio class = idle
osd disk thread ioprio priority = 3 #0-7 within idle class

Kind regards,
Simon
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com