Re: Slow ops during index pool recovery causes cluster performance drop to 1%

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Sun, 3 Nov 2024 05:08:10 +0000

Found another thread which was very similar to this about setting the osd_async_recovery_min_cost=0, however still didn't help.

I have an index pool osd this time (363) which generates slow ops since the beginning of the recovery until the end of it (the read latency spikes on this osd to the sky 150ms).

What seems weird is the pg acting set:
PG_STAT  STATE                                              UP                         UP_PRIMARY  ACTING                     ACTING_PRIMARY
26.509   active+recovery_wait+undersized+degraded+remapped              [363,762,744]         363                  [363,744]             363
26.4dd   active+recovery_wait+undersized+degraded+remapped              [763,522,363]         763                  [363,522]             363
26.120   active+undersized+degraded+remapped+backfill_wait              [363,109,274]         363                  [363,109]             363
26.6c       active+recovering+undersized+degraded+remapped              [363,273,772]         363                  [363,772]             363
26.222   active+recovery_wait+undersized+degraded+remapped              [597,363,152]         597                  [597,363]             597

Doesn't seem to be good that the acting totally missing the osds which just have been updated from octopus to quincy. But with size 3 min size 2 I think still should be able to write to those pgs and it should work properly.

BTW: osd_recovery_max_active, osd_recovery_op_priotiry and osd_maxbackfills are set to 1.

________________________________
From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
Sent: Saturday, November 2, 2024 6:45 AM
To: Ceph Users <ceph-users@xxxxxxx>
Subject: Slow ops during index pool recovery causes cluster performance drop to 1%

Hi,

I'm updating from octopus to quincy and all in our cluster when index pool recovery kicks off, cluster operation drops to 1%, slow ops comes non-stop.
The recovery takes 1-2 hours/nodes.

What I can see the iowait on the nvme drives which belongs to the index pool is pretty high, however the throughput is less than 500MB/s, the iops is less than 5000/sec.

The index pool is a 3:2 replica pool with 2048pg on 156 osd (1 nvme drive has 4 osds due to we experienced latency issue with 1 or 2 osd/nvme).

If we consider let's say the nvme drive still slow with these really small load, how would that be possible to somehow ease and get rid of this cluster performance drop?
If I increase replica to 4-5 would that help? It could tolerate more pg slowness maybe?

FYI we have many objects in our cluster, more than 4Billions: objects: 4.06G objects, 616 TiB

However I think it should still tolerate cluster recovery without penalty.

What I can see in the slow osd log with default debug value is about "get_health_metrics" so far :

2024-11-02T12:38:40.762+0700 7f241bc25640  0 log_channel(cluster) log [WRN] : 6 slow requests (by type [ 'delayed' : 6 ] most affected pool [ 'hkg.rgw.buckets.index' : 6 ])
2024-11-02T12:38:41.802+0700 7f241bc25640 -1 osd.110 626281 get_health_metrics reporting 7 slow ops, oldest is osd_op(client.3641786447.0:2194661324 26.588 26:11aa561a:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.503457179.1.10:head [call rgw.bucket_list in=47b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e626262)
2024-11-02T12:38:41.802+0700 7f241bc25640  0 log_channel(cluster) log [WRN] : 7 slow requests (by type [ 'delayed' : 7 ] most affected pool [ 'hkg.rgw.buckets.index' : 7 ])
2024-11-02T12:38:42.782+0700 7f241bc25640 -1 osd.110 626282 get_health_metrics reporting 7 slow ops, oldest is osd_op(client.3641786447.0:2194661324 26.588 26:11aa561a:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.503457179.1.10:head [call rgw.bucket_list in=47b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e626262)
2024-11-02T12:38:42.782+0700 7f241bc25640  0 log_channel(cluster) log [WRN] : 7 slow requests (by type [ 'delayed' : 7 ] most affected pool [ 'hkg.rgw.buckets.index' : 7 ])
2024-11-02T12:38:43.802+0700 7f241bc25640 -1 osd.110 626282 get_health_metrics reporting 7 slow ops, oldest is osd_op(client.3641786447.0:2194661324 26.588 26:11aa561a:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.503457179.1.10:head [call rgw.bucket_list in=47b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e626262)
2024-11-02T12:38:43.802+0700 7f241bc25640  0 log_channel(cluster) log [WRN] : 7 slow requests (by type [ 'delayed' : 7 ] most affected pool [ 'hkg.rgw.buckets.index' : 7 ])

How we also try to make it smoother, after update and machine reboot compaction kicks off which generates 30-40 iowait on the node, we prevent with "noup" flag to put these osds into the cluster until compaction finished, however when we have 0 iowait after compaction, I unset noup so recovery can start which causes the above issue. If I wouldn't set noup it would cause even bigger issue.

Thank you for help

________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx