Hi all, We're now having trouble over a week with our Ceph cluster. Short info regarding our situation: - Original cluster had 10 OSD nodes, each having 16 OSDs - Expansion was necessary, so another 6 nodes have been added - Version: 14.2.11 Last week we saw heavily loaded OSD servers, after help here we identified the disk load being too high due to compaction of the Rocks.db. Taking the disk offline and running ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/xxxx compact does take the load away temporarily as mentioned. Most of the new disks still have a weight of 0 as we want to get the system stable first, but there is something I simply don't understand. When setting the following flags: noout, norecover, nobackfill and norebalance prior taking the disk offline for compaction, we still get a raise in PGs degraded after the OSD is back marked as "up". This night, a flapping OSD was also temporarily marked as "offline", I assume because it was heavily loaded, causing again a rise in degraded PGs. I know that there is a flag "nodown", but I've never used it. Reading the docs, it states these flags are "temporary" and the blocking action will be performed anyhow afterwards... So I have a few questions: 1. Why is the cluster marking PGs as "degraded" and indicating degraded data redundancy, while this was not the case before? It rises (2196398/10339524249 objects degraded (0.021%)) and I simply cannot understand why it keeps going up... 2. The flag "nodown". Can I use this to prevent the flapping... I don't want to get a deeper mess. As far as I understood, it would help in our case as the OSDs are heavily used. 3. Is it a good idea to start adding the other disks as well (slowly increasing their weight)? Thansk, Kristof _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx