Re: Slow ops during index pool recovery causes cluster performance drop to 1%

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Wed, 13 Nov 2024 08:43:11 +0000

Hi Frédéric,

Thank you the ideas.

Cluster is half updated but on the osds which updated are:

"osd_op_queue": "mclock_scheduler",
"osd_op_queue_cut_off": "high",

I'd say the value when I do the benchmark how ceph calculates it, it is too high. We have 4 osd on 1 nvme and it sets the value on the last osd from the 4 on nvme which is the highest:
36490.280637

However I changed this value already on some other fully upgraded cluster divided by 4 and didn't help.
Buffered io turned on since octopus, didn't change it.

For a quick check that specific osd seems like what you tell:

1 : device size 0x6fc7c00000 : own 0x[40000~4e00000,12f70000~2252d0000,23b060000~21a230000,4583e0000~20f890000,6b1630000~200000000,35a78f0000~478a20000] = 0xccc5b0000 : using 0xa60ed0000(42 GiB) : bluestore has 0x62e79f0000(396 GiB) available
wal_total:0, db_total:456087987814, slow_total:0

Istvan

________________________________
From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
Sent: Monday, November 4, 2024 4:14 PM
To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
Cc: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx>
Subject: Re:  Re: Slow ops during index pool recovery causes cluster performance drop to 1%

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Hi Istvan,

Is you upgraded cluster using wpq or mclock scheduler? (ceph tell osd.X config show | grep osd_op_queue)

Maybe your OSDs set their osd_mclock_max_capacity_iops_* capacity too low on start (ceph config dump | grep osd_mclock_max_capacity_iops) limiting their performance.

You might want to raise these figures if set or go back to wpq to give you enough time to understand how mclock works.

Also, check bluefs_buffered_io as it's default value changed over time. Better run 'true' now (ceph tell osd.X config show | grep bluefs_buffered_io)
Also, check for any overspilling as there's been a bug in the past with overspilling not being reported on ceph status (ceph tell osd.X bluefs stats, SLOW line should show 0 Bytes and 0 FILES).

Regards,
Frédéric.

----- Le 4 Nov 24, à 5:24, Istvan Szabo, Agoda Istvan.Szabo@xxxxxxxxx a écrit :

> Hi Tyler,
>
> To be honest we don't have anything set by ourselves regarding compaction and
> rocksdb:
> When I check the socket with ceph daemon on nvme and on ssd both have default
> false on compactL
> "mon_compact_on_start": "false"
> "osd_compact_on_start": "false",
>
> Rocksdb also default:
> bluestore_rocksdb_options":
> "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_total_wal_size=1073741824"
>
> This is 1 event during the slow ops out of the 20:
> https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt
>
> All belongs to a bucket which doing streaming operation which means continuous
> delete and upload 24/7.
>
> I can see throttled options but still don't understand why the high latency.
>
>
> ty
>
> ________________________________
> From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>
> Sent: Sunday, November 3, 2024 4:07 PM
> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxx>
> Subject: Re:  Re: Slow ops during index pool recovery causes cluster
> performance drop to 1%
>
> Email received from the internet. If in doubt, don't click any link nor open any
> attachment !
> ________________________________
>
> On Sun, Nov 3, 2024 at 1:28 AM Szabo, Istvan (Agoda)
> <Istvan.Szabo@xxxxxxxxx> wrote:
>> Hi,
>>
>> I'm updating from octopus to quincy and all in our cluster when index pool
>> recovery kicks off, cluster operation drops to 1%, slow ops comes non-stop.
>> The recovery takes 1-2 hours/nodes.
>>
>> What I can see the iowait on the nvme drives which belongs to the index pool is
>> pretty high, however the throughput is less than 500MB/s, the iops is less than
>> 5000/sec.
> ...
>> after update and machine reboot compaction kicks off which generates 30-40
>> iowait on the node, we prevent with "noup" flag to put these osds into the
>> cluster until compaction finished, however when we have 0 iowait after
>> compaction, I unset noup so recovery can start which causes the above issue. If
>> I wouldn't set noup it would cause even bigger issue.
>
> By any chance, are you specifying a value for
> bluestore_rocksdb_options in your ceph.conf? The compaction
> observation at reboot in particular is odd.
>
> Tyler
>
> ________________________________
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright or
> other legal rules. If you have received it by mistake please let us know by
> reply email and delete it from your system. It is prohibited to copy this
> message or disclose its content to anyone. Any confidentiality or privilege is
> not waived or lost by any mistaken delivery or unauthorized disclosure of the
> message. All messages sent to and from Agoda may be monitored to ensure
> compliance with company policies, to protect the company's interests and to
> remove potential malware. Electronic messages may be intercepted, amended, lost
> or deleted, or contain viruses.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx