Re: Slow ops on OSDs

Kristof Coucke <kristof.coucke@xxxxxxxxx> · Tue, 6 Oct 2020 13:06:52 +0200

Is there a way that I can check if this process is causing performance
issues?
Can I check somehow if this process is causing the issue?

Op di 6 okt. 2020 om 13:05 schreef Igor Fedotov <ifedotov@xxxxxxx>:

>
> On 10/6/2020 1:04 PM, Kristof Coucke wrote:
>
> Another strange thing is going on:
>
> No client software is using the system any longer, so we would expect that
> all IOs are related to the recovery (fixing of the degraded PG).
> However, the disks that are reaching high IO are not a member of the PGs
> that are being fixed.
>
> So, something is heavily using the disk, but I can't find the process
> immediately. I've read something that there can be old client processes
> that keep on connecting to an OSD for retrieving data for a specific PG
> while that PG is no longer available on that disk.
>
>
> I bet it's rather PG removal happening in background....
>
>
> Op di 6 okt. 2020 om 11:41 schreef Kristof Coucke <
> kristof.coucke@xxxxxxxxx>:
>
>> Yes, some disks are spiking near 100%... The delay I see with the iostat
>> (r_await) seems to be synchronised with the delays between queued_for_pg
>> and reached_pg events.
>> The NVMe disks are not spiking, just the spinner disks.
>>
>> I know the rocksdb is only partial on the NVMe. The read-ahead is also
>> 128kb (os level) (for spinner disks). As we are dealing with smaller files,
>> this might also lead to a decrease of the performance.
>>
>> I'm still investigating, but I'm wondering if the system is also reading
>> from disk for finding the KV pairs.
>>
>>
>>
>> Op di 6 okt. 2020 om 11:23 schreef Igor Fedotov <ifedotov@xxxxxxx>:
>>
>>> Hi Kristof,
>>>
>>> are you seeing high (around 100%) OSDs' disks (main or DB ones)
>>> utilization along with slow  ops?
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>> On 10/6/2020 11:09 AM, Kristof Coucke wrote:
>>> > Hi all,
>>> >
>>> > We have a Ceph cluster which has been expanded from 10 to 16 nodes.
>>> > Each node has between 14 and 16 OSDs of which 2 are NVMe disks.
>>> > Most disks (except NVMe's) are 16TB large.
>>> >
>>> > The expansion of 16 nodes went ok, but we've configured the system to
>>> > prevent auto balance towards the new disks (weight was set to 0) so we
>>> > could control the expansion.
>>> >
>>> > We started adding 6 disks last week (1 disk on each new node) which
>>> didn't
>>> > give a lot of issues.
>>> > When the Ceph status indicated the PG degraded was almost finished,
>>> we've
>>> > added 2 disks on each node again.
>>> >
>>> > All seemed to go fine, till yesterday morning... IOs towards the system
>>> > were slowing down.
>>> >
>>> > Diving onto the nodes we could see that the OSD daemons are consuming
>>> the
>>> > CPU power, resulting in average CPU loads going near 10 (!).
>>> >
>>> > The RGWs nor monitors nor other involved servers are having CPU issues
>>> > (except for the management server which is fighting with Prometheus),
>>> so
>>> > it's latency seems to be related to the ODS hosts.
>>> > All of the hosts are interconnected with 25Gbit connections, no
>>> bottlenecks
>>> > are reached on the network either.
>>> >
>>> > Important piece of information: We are using erasure coding (6/3), and
>>> we
>>> > do have a lot of small files...
>>> > The current health detail indicates degraded health redundancy where
>>> > 1192911/103387889228 objects are degraded. (1 pg degraded, 1 pg
>>> undersized).
>>> >
>>> > Diving into the historic ops of an OSD we can see that the main
>>> latency is
>>> > found between the event "queued_for_pg" and "reached_pg". (Averaging
>>> +/- 3
>>> > secs)
>>> >
>>> > As the system load is quite high I assume the systems are busy
>>> > recalculating the code chunks for using the new disks we've added
>>> (though
>>> > not sure), but I was wondering how I can better fine tune the system or
>>> > pinpoint the exact bottle neck.
>>> > Latency towards the disks doesn't seem an issue at first sight...
>>> >
>>> > We are running Ceph 14.2.11
>>> >
>>> > Who can give me some thoughts on how I can better pinpoint the bottle
>>> neck?
>>> >
>>> > Thanks
>>> >
>>> > Kristof
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx