Re: Slow ops on OSDs

Igor Fedotov <ifedotov@xxxxxxx> · Tue, 6 Oct 2020 14:16:24 +0300

Unfortunately currently available Ceph releases lack any means to 
monitor KV data removal. The only way is to set debug_bluestore to 20 
(for a short period of time, e.g. 1 min) and inspect OSD log for 
_remove/_do_remove/_omap_clear calls. Plenty of them within the 
inspected period means ongoing  removals.

A weak proof of the hypothesis would be having non-zero "numpg_removing" 
performance counter...

On 10/6/2020 2:06 PM, Kristof Coucke wrote:
Is there a way that I can check if this process is causing performance 
issues?
Can I check somehow if this process is causing the issue?

Op di 6 okt. 2020 om 13:05 schreef Igor Fedotov <ifedotov@xxxxxxx 
<mailto:ifedotov@xxxxxxx>>:

    On 10/6/2020 1:04 PM, Kristof Coucke wrote:
    Another strange thing is going on:

    No client software is using the system any longer, so we would
    expect that all IOs are related to the recovery (fixing of the
    degraded PG).
    However, the disks that are reaching high IO are not a member of
    the PGs that are being fixed.

    So, something is heavily using the disk, but I can't find the
    process immediately. I've read something that there can be old
    client processes that keep on connecting to an OSD for retrieving
    data for a specific PG while that PG is no longer available on
    that disk.

    I bet it's rather PG removal happening in background....

    Op di 6 okt. 2020 om 11:41 schreef Kristof Coucke
    <kristof.coucke@xxxxxxxxx <mailto:kristof.coucke@xxxxxxxxx>>:

        Yes, some disks are spiking near 100%... The delay I see with
        the iostat (r_await) seems to be synchronised with the delays
        between queued_for_pg and reached_pg events.
        The NVMe disks are not spiking, just the spinner disks.

        I know the rocksdb is only partial on the NVMe. The
        read-ahead is also 128kb (os level) (for spinner disks). As
        we are dealing with smaller files, this might also lead to a
        decrease of the performance.

        I'm still investigating, but I'm wondering if the system is
        also reading from disk for finding the KV pairs.

        Op di 6 okt. 2020 om 11:23 schreef Igor Fedotov
        <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>>:

            Hi Kristof,

            are you seeing high (around 100%) OSDs' disks (main or DB
            ones)
            utilization along with slow  ops?

            Thanks,

            Igor

            On 10/6/2020 11:09 AM, Kristof Coucke wrote:
            > Hi all,
            >
            > We have a Ceph cluster which has been expanded from 10
            to 16 nodes.
            > Each node has between 14 and 16 OSDs of which 2 are
            NVMe disks.
            > Most disks (except NVMe's) are 16TB large.
            >
            > The expansion of 16 nodes went ok, but we've configured
            the system to
            > prevent auto balance towards the new disks (weight was
            set to 0) so we
            > could control the expansion.
            >
            > We started adding 6 disks last week (1 disk on each new
            node) which didn't
            > give a lot of issues.
            > When the Ceph status indicated the PG degraded was
            almost finished, we've
            > added 2 disks on each node again.
            >
            > All seemed to go fine, till yesterday morning... IOs
            towards the system
            > were slowing down.
            >
            > Diving onto the nodes we could see that the OSD daemons
            are consuming the
            > CPU power, resulting in average CPU loads going near 10
            (!).
            >
            > The RGWs nor monitors nor other involved servers are
            having CPU issues
            > (except for the management server which is fighting
            with Prometheus), so
            > it's latency seems to be related to the ODS hosts.
            > All of the hosts are interconnected with 25Gbit
            connections, no bottlenecks
            > are reached on the network either.
            >
            > Important piece of information: We are using erasure
            coding (6/3), and we
            > do have a lot of small files...
            > The current health detail indicates degraded health
            redundancy where
            > 1192911/103387889228 objects are degraded. (1 pg
            degraded, 1 pg undersized).
            >
            > Diving into the historic ops of an OSD we can see that
            the main latency is
            > found between the event "queued_for_pg" and
            "reached_pg". (Averaging +/- 3
            > secs)
            >
            > As the system load is quite high I assume the systems
            are busy
            > recalculating the code chunks for using the new disks
            we've added (though
            > not sure), but I was wondering how I can better fine
            tune the system or
            > pinpoint the exact bottle neck.
            > Latency towards the disks doesn't seem an issue at
            first sight...
            >
            > We are running Ceph 14.2.11
            >
            > Who can give me some thoughts on how I can better
            pinpoint the bottle neck?
            >
            > Thanks
            >
            > Kristof
            > _______________________________________________
            > ceph-users mailing list -- ceph-users@xxxxxxx
            <mailto:ceph-users@xxxxxxx>
            > To unsubscribe send an email to
            ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx