Hi, We've experienced a very weird problem last week with our Ceph cluster. We would like to ask your opinion(s) and advice Our dedicated Ceph OSD nodes run with: Total platform - IO Average: 2500 wrps, ~ 600 rps - Replica's: 3x 2 pools: - SSD (~50x1TB) - Spinner (~36x2TB) - 1024 PG's per pool (so 2048 in total) Each node has: - 24GB of ram. - Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6-core) - 10Gbit connection (2x, 1x backend, 1x frontend access) - ~7 SSD OSD and ~5 HDD OSD - Ceph Hammer - Ubuntu 14.04.3 - Stock kernels: 3.13.0-61-generic (8 hosts in total, monitors are dedicated machines) The SSD pool is our primary service delivery for delivering RBD images to VM's. The HDD pool is for massive data/slow data. Problems we had concentrated on the SSD part of the pool. We understand that each OSD with 1 terabyte should expect to run with 1GB of ram. With 7 SSD & 5 HDD, we cover about 17TB, so with our 24GB we are getting to the limit, but not yet over it. The problem: We added a new OSD (SSD) to the cluster, normally this starts with a bit of a higher load but doesn't make the cluster unstable to use. At some point in the process/recover, all client IO stalled/became extremly slow. While investigating, we discovered that not the added SSD was the culprit, but another OSD which was in the cluster for a longer period. We saw slow/blocked requests, up to a few hundred of them. Chronology of events: - Added a new OSD - Cluster started to recover/move PGs, went ok for a few hours - Cluster went unstable at some point We saw that an OSD was freaking out (not the added OSD), - We stopped the OSD (84% full), which solved the IO problems - We started the OSD again, resulting in the same problems, the disk ended up in backfill_toofull at some point (>85%). - We reweighted the OSD to 0.8, after that the recovery went smoothly. What could've caused this entire freakout? Was it the bad apple OSD which was 85% full? was it a combination of things? What is the experience of other people on this list? We have some questions we put below and would love to hear the experiences of other Ceph administrators. Questions we have: - Should we increase RAM in the nodes? - Should we enable trimming on the SSD's? - How much headroom should we keep in mind in terms of storage? (currently we have about ~65% full, the least full disk is ~29%, the most full disk ~80%) - Would it be good to have seperate clusters for spinning & ssd's? - If only one disk is the culprit, why does this affect I/O? Shouldn't Ceph discard/delay writes to this disk? - What CPU capacity is advisable? We can add a second E5-2620 v2 @ 2.10GHz to 16-cores, is that advisable? What we do already to slowdown recovery: We only add disks one-by-one CFQ queuing is enabled settings: osd_backfill_scan_min = 4 osd_backfill_scan_max = 8 osd_max_backfills = 2 osd_recovery_max_active = 1 osd client op priority = 63 osd recovery op priority = 1 osd disk thread ioprio class = idle osd disk thread ioprio priority = 3 #0-7 within idle class Kind regards, Simon _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com