Re: Slow ops during index pool recovery causes cluster performance drop to 1%

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Thu, 14 Nov 2024 14:38:40 +0100 (CET)

How many RGW gateways? With 300 update requests per second, I would start by increasing the number of shards. 

Frédéric. 

----- Le 14 Nov 24, à 13:33, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit : 

> This bucket receives 300 post/put/delete a sec.
> I'll take a look at that, thank you.
> 37x4/nvme, however yes, I think we need to increase for now.
> Thank you.

> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
> Sent: Thursday, November 14, 2024 5:50 PM
> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxx>
> Subject: Re:  Re: Slow ops during index pool recovery causes cluster
> performance drop to 1%
> Email received from the internet. If in doubt, don't click any link nor open any
> attachment !

> I don't know how many pools you have in your cluster but ~37 PGs per OSD seems
> quite low, especially with NVMes. You could try increasing the number of PGs on
> this pool and maybe the data pool also.
> I don't know how many iops this bucket receives but the fact that index is
> spread over only 11 rados objects could be a bottleneck with very intensive
> PUT/DELETE workloads. Maybe someone could confirm that.

> Also check for 'tombstones' and this topic [1] in particular, especially if the
> bucket receives a lot of PUT/DELETE operations in real time.

> Regards,
> Frédéric.

> [1] https://www.spinics.net/lists/ceph-users/msg81519.html

> ----- Le 14 Nov 24, à 10:55, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a
> écrit :

>> 156x NVME osd
>> Sharding I do like 100000 objects/1 shard. Default 11 but they don't have 1.1m
>> objects.

>> This is the tree: [
>> https://gist.github.com/Badb0yBadb0y/835a45f8e82ddfcbbd82cf28126da728 |
>> https://gist.github.com/Badb0yBadb0y/835a45f8e82ddfcbbd82cf28126da728 ]

>> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
>> Sent: Thursday, November 14, 2024 4:28 PM
>> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>> Cc: Ceph Users <ceph-users@xxxxxxx>
>> Subject: Re:  Re: Slow ops during index pool recovery causes cluster
>> performance drop to 1%
>> Email received from the internet. If in doubt, don't click any link nor open any
>> attachment !

>> Hi Istvan,

>>> Only thing what I have in my mind to increase the replica size from 3 to 5 so it
>> > could tollerate more osd slowness with size 5 min_size 2.

>> I wouldn't do that, it will only get worse as every write IO will have to wait
>> for 2 mores OSDs to ACK and the slow ops you've seen refer to write IOs
>> (looping over "waiting for rw locks").

>> How many NVMe OSDs does this 2048 PGs RGW index pool has?

>> Have you check the num_shards of this bucket that is receiving continuous
>> deletes and uploads 24/7?

>> Regards,
>> Frédéric.

>> ----- Le 14 Nov 24, à 7:16, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit
>> :

>>> Hi,

>>> This issue was for us before update also, unluckiuly it's not gone with update
>>> 😕
>>> We don't use HDD, only ssd and nvme and the index pool is specifically on nvme.
>>> Yes, I tried to set for the value divided by 4, no luck 🙁

>>> Seems like based on metadata okay, however the device class when I've created it
>>> I've defined nvme (ceph-volume lvm batch --bluestore --yes --osds-per-device 4
>>> --crush-device-class nvme /dev/sdo) and in the osd tree it is nvme, but I guess
>>> it means what it says by default if I don't define anything it would have been
>>> ssd.
>>> "bluestore_bdev_type": "ssd",
>>> "default_device_class": "ssd",
>>> "osd_objectstore": "bluestore",
>>> "rotational": "0"

>>> Only thing what I have in my mind to increase the replica size from 3 to 5 so it
>>> could tollerate more osd slowness with size 5 min_size 2.

>>> Again, thank you again for your ideas.

>>> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
>>> Sent: Wednesday, November 13, 2024 4:32 PM
>>> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>>> Cc: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx>
>>> Subject: Re:  Re: Slow ops during index pool recovery causes cluster
>>> performance drop to 1%
>>> Email received from the internet. If in doubt, don't click any link nor open any
>>> attachment !

>>> Hi Istvan,

>>> Changing the scheduler to 'wpq' could help you to quickly identify if the issue
>>> you're facing is related to 'mclock' or not.

>>> If you stick with mclock, depending on the rotational status of each OSD (ceph
>>> osd metadata N | jq -r .rotational), you should set each OSD's spec
>>> (osd_mclock_max_capacity_iops_hdd if rotational=1 or
>>> osd_mclock_max_capacity_iops_ssd if rotational=0) to the value you calculated,
>>> instead of letting the OSD trying to figure out and set a value that may not be
>>> accurate, especially with multiple OSDs sharing the same underlying device.

>>> Have you tried setting each OSD's max capacity (ceph config set osd.N
>>> osd_mclock_max_capacity_iops_[hdd, ssd])?

>>> Also, make sure the rotational status reported for each OSDs by ceph osd
>>> metadata osd.N actually matches the underlying hardware type. This is not
>>> always the case depending on how the disks are connected.
>>> If it's not, you might have to force it on boot with a udev rule.

>>> Regards,
>>> Frédéric.

>>> ----- Le 13 Nov 24, à 9:43, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit
>>> :

>>>> Hi Frédéric,

>>>> Thank you the ideas.

>>>> Cluster is half updated but on the osds which updated are:

>>>> "osd_op_queue": "mclock_scheduler",
>>>> "osd_op_queue_cut_off": "high",

>>>> I'd say the value when I do the benchmark how ceph calculates it, it is too
>>>> high. We have 4 osd on 1 nvme and it sets the value on the last osd from the 4
>>>> on nvme which is the highest:
>>>> 36490.280637

>>>> However I changed this value already on some other fully upgraded cluster
>>>> divided by 4 and didn't help.
>>>> Buffered io turned on since octopus, didn't change it.

>>>> For a quick check that specific osd seems like what you tell:

>>>> 1 : device size 0x6fc7c00000 : own
>>>> 0x[40000~4e00000,12f70000~2252d0000,23b060000~21a230000,4583e0000~20f890000,6b1630000~200000000,35a78f0000~478a20000]
>>>> = 0xccc5b0000 : using 0xa60ed0000(42 GiB) : bluestore has 0x62e79f0000(396 GiB)
>>>> available
>>>> wal_total:0, db_total:456087987814, slow_total:0

>>>> Istvan

>>>> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
>>>> Sent: Monday, November 4, 2024 4:14 PM
>>>> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>>>> Cc: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx>
>>>> Subject: Re:  Re: Slow ops during index pool recovery causes cluster
>>>> performance drop to 1%
>>>> Email received from the internet. If in doubt, don't click any link nor open any
>>>> attachment !
>>>> ________________________________

>>>> Hi Istvan,

>>>> Is you upgraded cluster using wpq or mclock scheduler? (ceph tell osd.X config
>>>> show | grep osd_op_queue)

>>>> Maybe your OSDs set their osd_mclock_max_capacity_iops_* capacity too low on
>>>> start (ceph config dump | grep osd_mclock_max_capacity_iops) limiting their
>>>> performance.

>>>> You might want to raise these figures if set or go back to wpq to give you
>>>> enough time to understand how mclock works.

>>>> Also, check bluefs_buffered_io as it's default value changed over time. Better
>>>> run 'true' now (ceph tell osd.X config show | grep bluefs_buffered_io)
>>>> Also, check for any overspilling as there's been a bug in the past with
>>>> overspilling not being reported on ceph status (ceph tell osd.X bluefs stats,
>>>> SLOW line should show 0 Bytes and 0 FILES).

>>>> Regards,
>>>> Frédéric.

>>>> ----- Le 4 Nov 24, à 5:24, Istvan Szabo, Agoda Istvan.Szabo@xxxxxxxxx a écrit :

>>>> > Hi Tyler,

>>>> > To be honest we don't have anything set by ourselves regarding compaction and
>>>> > rocksdb:
>>>> > When I check the socket with ceph daemon on nvme and on ssd both have default
>>>> > false on compactL
>>>> > "mon_compact_on_start": "false"
>>>> > "osd_compact_on_start": "false",

>>>> > Rocksdb also default:
>>>> > bluestore_rocksdb_options":
>>>> > "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_total_wal_size=1073741824"

>>>> > This is 1 event during the slow ops out of the 20:
>>>>> [
>>>>> https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt
>>>> > |
>>>> https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt
>>>> ]

>>>> > All belongs to a bucket which doing streaming operation which means continuous
>>>> > delete and upload 24/7.

>>>> > I can see throttled options but still don't understand why the high latency.

>>>> > ty

>>>> > ________________________________
>>>> > From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>
>>>> > Sent: Sunday, November 3, 2024 4:07 PM
>>>> > To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>>>> > Cc: Ceph Users <ceph-users@xxxxxxx>
>>>> > Subject: Re:  Re: Slow ops during index pool recovery causes cluster
>>>> > performance drop to 1%

>>>> > Email received from the internet. If in doubt, don't click any link nor open any
>>>> > attachment !
>>>> > ________________________________

>>>> > On Sun, Nov 3, 2024 at 1:28 AM Szabo, Istvan (Agoda)
>>>> > <Istvan.Szabo@xxxxxxxxx> wrote:
>>>> >> Hi,

>>>> >> I'm updating from octopus to quincy and all in our cluster when index pool
>>>> >> recovery kicks off, cluster operation drops to 1%, slow ops comes non-stop.
>>>> >> The recovery takes 1-2 hours/nodes.

>>>> >> What I can see the iowait on the nvme drives which belongs to the index pool is
>>>> >> pretty high, however the throughput is less than 500MB/s, the iops is less than
>>>> >> 5000/sec.
>>>> > ...
>>>> >> after update and machine reboot compaction kicks off which generates 30-40
>>>> >> iowait on the node, we prevent with "noup" flag to put these osds into the
>>>> >> cluster until compaction finished, however when we have 0 iowait after
>>>> >> compaction, I unset noup so recovery can start which causes the above issue. If
>>>> >> I wouldn't set noup it would cause even bigger issue.

>>>> > By any chance, are you specifying a value for
>>>> > bluestore_rocksdb_options in your ceph.conf? The compaction
>>>> > observation at reboot in particular is odd.

>>>> > Tyler

>>>> > ________________________________
>>>> > This message is confidential and is for the sole use of the intended
>>>> > recipient(s). It may also be privileged or otherwise protected by copyright or
>>>> > other legal rules. If you have received it by mistake please let us know by
>>>> > reply email and delete it from your system. It is prohibited to copy this
>>>> > message or disclose its content to anyone. Any confidentiality or privilege is
>>>> > not waived or lost by any mistaken delivery or unauthorized disclosure of the
>>>> > message. All messages sent to and from Agoda may be monitored to ensure
>>>> > compliance with company policies, to protect the company's interests and to
>>>> > remove potential malware. Electronic messages may be intercepted, amended, lost
>>>> > or deleted, or contain viruses.
>>>> > _______________________________________________
>>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx

>>>> This message is confidential and is for the sole use of the intended
>>>> recipient(s). It may also be privileged or otherwise protected by copyright or
>>>> other legal rules. If you have received it by mistake please let us know by
>>>> reply email and delete it from your system. It is prohibited to copy this
>>>> message or disclose its content to anyone. Any confidentiality or privilege is
>>>> not waived or lost by any mistaken delivery or unauthorized disclosure of the
>>>> message. All messages sent to and from Agoda may be monitored to ensure
>>>> compliance with company policies, to protect the company's interests and to
>>>> remove potential malware. Electronic messages may be intercepted, amended, lost
>>>> or deleted, or contain viruses.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx