Hi all, We have a Ceph cluster which has been expanded from 10 to 16 nodes. Each node has between 14 and 16 OSDs of which 2 are NVMe disks. Most disks (except NVMe's) are 16TB large. The expansion of 16 nodes went ok, but we've configured the system to prevent auto balance towards the new disks (weight was set to 0) so we could control the expansion. We started adding 6 disks last week (1 disk on each new node) which didn't give a lot of issues. When the Ceph status indicated the PG degraded was almost finished, we've added 2 disks on each node again. All seemed to go fine, till yesterday morning... IOs towards the system were slowing down. Diving onto the nodes we could see that the OSD daemons are consuming the CPU power, resulting in average CPU loads going near 10 (!). The RGWs nor monitors nor other involved servers are having CPU issues (except for the management server which is fighting with Prometheus), so it's latency seems to be related to the ODS hosts. All of the hosts are interconnected with 25Gbit connections, no bottlenecks are reached on the network either. Important piece of information: We are using erasure coding (6/3), and we do have a lot of small files... The current health detail indicates degraded health redundancy where 1192911/103387889228 objects are degraded. (1 pg degraded, 1 pg undersized). Diving into the historic ops of an OSD we can see that the main latency is found between the event "queued_for_pg" and "reached_pg". (Averaging +/- 3 secs) As the system load is quite high I assume the systems are busy recalculating the code chunks for using the new disks we've added (though not sure), but I was wondering how I can better fine tune the system or pinpoint the exact bottle neck. Latency towards the disks doesn't seem an issue at first sight... We are running Ceph 14.2.11 Who can give me some thoughts on how I can better pinpoint the bottle neck? Thanks Kristof _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx