Yes, some disks are spiking near 100%... The delay I see with the iostat (r_await) seems to be synchronised with the delays between queued_for_pg and reached_pg events. The NVMe disks are not spiking, just the spinner disks. I know the rocksdb is only partial on the NVMe. The read-ahead is also 128kb (os level) (for spinner disks). As we are dealing with smaller files, this might also lead to a decrease of the performance. I'm still investigating, but I'm wondering if the system is also reading from disk for finding the KV pairs. Op di 6 okt. 2020 om 11:23 schreef Igor Fedotov <ifedotov@xxxxxxx>: > Hi Kristof, > > are you seeing high (around 100%) OSDs' disks (main or DB ones) > utilization along with slow ops? > > > Thanks, > > Igor > > On 10/6/2020 11:09 AM, Kristof Coucke wrote: > > Hi all, > > > > We have a Ceph cluster which has been expanded from 10 to 16 nodes. > > Each node has between 14 and 16 OSDs of which 2 are NVMe disks. > > Most disks (except NVMe's) are 16TB large. > > > > The expansion of 16 nodes went ok, but we've configured the system to > > prevent auto balance towards the new disks (weight was set to 0) so we > > could control the expansion. > > > > We started adding 6 disks last week (1 disk on each new node) which > didn't > > give a lot of issues. > > When the Ceph status indicated the PG degraded was almost finished, we've > > added 2 disks on each node again. > > > > All seemed to go fine, till yesterday morning... IOs towards the system > > were slowing down. > > > > Diving onto the nodes we could see that the OSD daemons are consuming the > > CPU power, resulting in average CPU loads going near 10 (!). > > > > The RGWs nor monitors nor other involved servers are having CPU issues > > (except for the management server which is fighting with Prometheus), so > > it's latency seems to be related to the ODS hosts. > > All of the hosts are interconnected with 25Gbit connections, no > bottlenecks > > are reached on the network either. > > > > Important piece of information: We are using erasure coding (6/3), and we > > do have a lot of small files... > > The current health detail indicates degraded health redundancy where > > 1192911/103387889228 objects are degraded. (1 pg degraded, 1 pg > undersized). > > > > Diving into the historic ops of an OSD we can see that the main latency > is > > found between the event "queued_for_pg" and "reached_pg". (Averaging +/- > 3 > > secs) > > > > As the system load is quite high I assume the systems are busy > > recalculating the code chunks for using the new disks we've added (though > > not sure), but I was wondering how I can better fine tune the system or > > pinpoint the exact bottle neck. > > Latency towards the disks doesn't seem an issue at first sight... > > > > We are running Ceph 14.2.11 > > > > Who can give me some thoughts on how I can better pinpoint the bottle > neck? > > > > Thanks > > > > Kristof > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx