Excessive write load on mons after upgrade from 12.2.13 -> 14.2.7

Peter Woodman <peter@xxxxxxxxxxxx> · Thu, 13 Feb 2020 16:17:09 -0500

Hey, I've been running a ceph cluster of arm64 SOCs on Luminous for the
past year or so, with no major problems. I recently upgraded to 14.2.7, and
the stability of the cluster immediately suffered. Seemed like any mon
activity was subject to long pauses, and the cluster would hang frequently.

Looking at ceph -s, it appeared the cluster was electing new masters very
frequently - masters didn't seem to last longer than about 1-2 minutes.
Looking further at the mons, two out of three of which are running on
relatively slow-performing SD card storage on these SoCs, I saw them
absolutely maxing out the root device's IO with writes. Logs show rocksdb
constantly running compactions. I temporarily moved these mons to devices
with better performing IO (but shouldn't be mons as they're also cephfs
clients) and saw a sustained write rate of ~50MB/s. This seems pretty
excessive, and is at least an order of magnitude higher than anything I saw
when running Luminous. Not to mention this isn't really nice for SSD
lifespan.

As downgrading is not an option here, is there anything I can look to to
figure out what exactly the mons are doing and how to prevent such heavy
load? I seem to remember some bug related to telemetry, but can't find it
on this list..
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx