Re: Slow ops on OSDs

Igor Fedotov <ifedotov@xxxxxxx> · Tue, 6 Oct 2020 14:04:10 +0300

I presume that this might be caused by massive KV data removal which was 
initiated after(or during) data rebalance. We've seen multiple complains 
about RocksDB's performance negatively affected by pool/pg removal. And 
I expect data rebalance might suffer from the same...

You might want to run manual DB compaction using ceph-kvstore-tool for 
every affected OSD to try to work around the issue. This would rather 
help just temporarily if data removal is still ongoing though.

On 10/6/2020 12:41 PM, Kristof Coucke wrote:
Yes, some disks are spiking near 100%... The delay I see with the 
iostat (r_await) seems to be synchronised with the delays between 
queued_for_pg and reached_pg events.
The NVMe disks are not spiking, just the spinner disks.

I know the rocksdb is only partial on the NVMe. The read-ahead is also 
128kb (os level) (for spinner disks). As we are dealing with smaller 
files, this might also lead to a decrease of the performance.

Can you share the amount of DB data spilled over to spinners? You can 
learn this from "bluefs" section in performance counters dump...

I'm still investigating, but I'm wondering if the system is also 
reading from disk for finding the KV pairs.

Op di 6 okt. 2020 om 11:23 schreef Igor Fedotov <ifedotov@xxxxxxx 
<mailto:ifedotov@xxxxxxx>>:

    Hi Kristof,

    are you seeing high (around 100%) OSDs' disks (main or DB ones)
    utilization along with slow  ops?

    Thanks,

    Igor

    On 10/6/2020 11:09 AM, Kristof Coucke wrote:
    > Hi all,
    >
    > We have a Ceph cluster which has been expanded from 10 to 16 nodes.
    > Each node has between 14 and 16 OSDs of which 2 are NVMe disks.
    > Most disks (except NVMe's) are 16TB large.
    >
    > The expansion of 16 nodes went ok, but we've configured the
    system to
    > prevent auto balance towards the new disks (weight was set to 0)
    so we
    > could control the expansion.
    >
    > We started adding 6 disks last week (1 disk on each new node)
    which didn't
    > give a lot of issues.
    > When the Ceph status indicated the PG degraded was almost
    finished, we've
    > added 2 disks on each node again.
    >
    > All seemed to go fine, till yesterday morning... IOs towards the
    system
    > were slowing down.
    >
    > Diving onto the nodes we could see that the OSD daemons are
    consuming the
    > CPU power, resulting in average CPU loads going near 10 (!).
    >
    > The RGWs nor monitors nor other involved servers are having CPU
    issues
    > (except for the management server which is fighting with
    Prometheus), so
    > it's latency seems to be related to the ODS hosts.
    > All of the hosts are interconnected with 25Gbit connections, no
    bottlenecks
    > are reached on the network either.
    >
    > Important piece of information: We are using erasure coding
    (6/3), and we
    > do have a lot of small files...
    > The current health detail indicates degraded health redundancy where
    > 1192911/103387889228 objects are degraded. (1 pg degraded, 1 pg
    undersized).
    >
    > Diving into the historic ops of an OSD we can see that the main
    latency is
    > found between the event "queued_for_pg" and "reached_pg".
    (Averaging +/- 3
    > secs)
    >
    > As the system load is quite high I assume the systems are busy
    > recalculating the code chunks for using the new disks we've
    added (though
    > not sure), but I was wondering how I can better fine tune the
    system or
    > pinpoint the exact bottle neck.
    > Latency towards the disks doesn't seem an issue at first sight...
    >
    > We are running Ceph 14.2.11
    >
    > Who can give me some thoughts on how I can better pinpoint the
    bottle neck?
    >
    > Thanks
    >
    > Kristof
    > _______________________________________________
    > ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    > To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx