Re: Slow ops during index pool recovery causes cluster performance drop to 1%

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Fri, 15 Nov 2024 15:07:58 +0100 (CET)

Nice use case! I don't know if these many iops relate to the same bucket which you said has only 11 shards, but if they do, then you'd better increase the number of shards ASAP as it's probably the bottlneck in your situation (not the PGs per OSD. Would still help though). 

IIRC, each metadata shard is a single rados object that lists part of the bucket's data. Everytime an object from that list needs an update, this rados object need to be updated. Limiting the number of objects that a single shard has (~100k) certainly reduce the chances that an S3 operation will 'touch' this object. But still, with such many write iops a sec. one would expect these shards to be touch way more often, perhaps too often even to keep up with the workload. 

Hence, increasing the number of shards could help with that many iops. I think. 

Frédéric. 

PS: You might want to consider scheduling this resharding operation during a lower traffic period. 

----- Le 15 Nov 24, à 4:43, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit : 

> That is just one user, cluster has in peak time 1.1M read IOPS with 10GiB/s read
> throughput on 27-30 gws and around 20-50k write iops with 1.5-2GiB/s write
> throughput.

> I'll give a try to increase index pool pg with aiming to 400pg/nvme.

> Istvan

> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
> Sent: Thursday, November 14, 2024 8:38:40 PM
> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxx>
> Subject: Re:  Re: Slow ops during index pool recovery causes cluster
> performance drop to 1%
> Email received from the internet. If in doubt, don't click any link nor open any
> attachment !

> How many RGW gateways? With 300 update requests per second, I would start by
> increasing the number of shards.

> Frédéric.

> ----- Le 14 Nov 24, à 13:33, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a
> écrit :

>> This bucket receives 300 post/put/delete a sec.
>> I'll take a look at that, thank you.
>> 37x4/nvme, however yes, I think we need to increase for now.
>> Thank you.

>> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
>> Sent: Thursday, November 14, 2024 5:50 PM
>> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>> Cc: Ceph Users <ceph-users@xxxxxxx>
>> Subject: Re:  Re: Slow ops during index pool recovery causes cluster
>> performance drop to 1%
>> Email received from the internet. If in doubt, don't click any link nor open any
>> attachment !

>> I don't know how many pools you have in your cluster but ~37 PGs per OSD seems
>> quite low, especially with NVMes. You could try increasing the number of PGs on
>> this pool and maybe the data pool also.
>> I don't know how many iops this bucket receives but the fact that index is
>> spread over only 11 rados objects could be a bottleneck with very intensive
>> PUT/DELETE workloads. Maybe someone could confirm that.

>> Also check for 'tombstones' and this topic [1] in particular, especially if the
>> bucket receives a lot of PUT/DELETE operations in real time.

>> Regards,
>> Frédéric.

>> [1] https://www.spinics.net/lists/ceph-users/msg81519.html

>> ----- Le 14 Nov 24, à 10:55, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a
>> écrit :

>>> 156x NVME osd
>>> Sharding I do like 100000 objects/1 shard. Default 11 but they don't have 1.1m
>>> objects.

>>> This is the tree: [
>>> https://gist.github.com/Badb0yBadb0y/835a45f8e82ddfcbbd82cf28126da728 |
>>> https://gist.github.com/Badb0yBadb0y/835a45f8e82ddfcbbd82cf28126da728 ]

>>> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
>>> Sent: Thursday, November 14, 2024 4:28 PM
>>> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>>> Cc: Ceph Users <ceph-users@xxxxxxx>
>>> Subject: Re:  Re: Slow ops during index pool recovery causes cluster
>>> performance drop to 1%
>>> Email received from the internet. If in doubt, don't click any link nor open any
>>> attachment !

>>> Hi Istvan,

>>>> Only thing what I have in my mind to increase the replica size from 3 to 5 so it
>>> > could tollerate more osd slowness with size 5 min_size 2.

>>> I wouldn't do that, it will only get worse as every write IO will have to wait
>>> for 2 mores OSDs to ACK and the slow ops you've seen refer to write IOs
>>> (looping over "waiting for rw locks").

>>> How many NVMe OSDs does this 2048 PGs RGW index pool has?

>>> Have you check the num_shards of this bucket that is receiving continuous
>>> deletes and uploads 24/7?

>>> Regards,
>>> Frédéric.

>>> ----- Le 14 Nov 24, à 7:16, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit
>>> :

>>>> Hi,

>>>> This issue was for us before update also, unluckiuly it's not gone with update
>>>> 😕
>>>> We don't use HDD, only ssd and nvme and the index pool is specifically on nvme.
>>>> Yes, I tried to set for the value divided by 4, no luck 🙁

>>>> Seems like based on metadata okay, however the device class when I've created it
>>>> I've defined nvme (ceph-volume lvm batch --bluestore --yes --osds-per-device 4
>>>> --crush-device-class nvme /dev/sdo) and in the osd tree it is nvme, but I guess
>>>> it means what it says by default if I don't define anything it would have been
>>>> ssd.
>>>> "bluestore_bdev_type": "ssd",
>>>> "default_device_class": "ssd",
>>>> "osd_objectstore": "bluestore",
>>>> "rotational": "0"

>>>> Only thing what I have in my mind to increase the replica size from 3 to 5 so it
>>>> could tollerate more osd slowness with size 5 min_size 2.

>>>> Again, thank you again for your ideas.

>>>> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
>>>> Sent: Wednesday, November 13, 2024 4:32 PM
>>>> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>>>> Cc: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx>
>>>> Subject: Re:  Re: Slow ops during index pool recovery causes cluster
>>>> performance drop to 1%
>>>> Email received from the internet. If in doubt, don't click any link nor open any
>>>> attachment !

>>>> Hi Istvan,

>>>> Changing the scheduler to 'wpq' could help you to quickly identify if the issue
>>>> you're facing is related to 'mclock' or not.

>>>> If you stick with mclock, depending on the rotational status of each OSD (ceph
>>>> osd metadata N | jq -r .rotational), you should set each OSD's spec
>>>> (osd_mclock_max_capacity_iops_hdd if rotational=1 or
>>>> osd_mclock_max_capacity_iops_ssd if rotational=0) to the value you calculated,
>>>> instead of letting the OSD trying to figure out and set a value that may not be
>>>> accurate, especially with multiple OSDs sharing the same underlying device.

>>>> Have you tried setting each OSD's max capacity (ceph config set osd.N
>>>> osd_mclock_max_capacity_iops_[hdd, ssd])?

>>>> Also, make sure the rotational status reported for each OSDs by ceph osd
>>>> metadata osd.N actually matches the underlying hardware type. This is not
>>>> always the case depending on how the disks are connected.
>>>> If it's not, you might have to force it on boot with a udev rule.

>>>> Regards,
>>>> Frédéric.

>>>> ----- Le 13 Nov 24, à 9:43, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit
>>>> :

>>>>> Hi Frédéric,

>>>>> Thank you the ideas.

>>>>> Cluster is half updated but on the osds which updated are:

>>>>> "osd_op_queue": "mclock_scheduler",
>>>>> "osd_op_queue_cut_off": "high",

>>>>> I'd say the value when I do the benchmark how ceph calculates it, it is too
>>>>> high. We have 4 osd on 1 nvme and it sets the value on the last osd from the 4
>>>>> on nvme which is the highest:
>>>>> 36490.280637

>>>>> However I changed this value already on some other fully upgraded cluster
>>>>> divided by 4 and didn't help.
>>>>> Buffered io turned on since octopus, didn't change it.

>>>>> For a quick check that specific osd seems like what you tell:

>>>>> 1 : device size 0x6fc7c00000 : own
>>>>> 0x[40000~4e00000,12f70000~2252d0000,23b060000~21a230000,4583e0000~20f890000,6b1630000~200000000,35a78f0000~478a20000]
>>>>> = 0xccc5b0000 : using 0xa60ed0000(42 GiB) : bluestore has 0x62e79f0000(396 GiB)
>>>>> available
>>>>> wal_total:0, db_total:456087987814, slow_total:0

>>>>> Istvan

>>>>> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
>>>>> Sent: Monday, November 4, 2024 4:14 PM
>>>>> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>>>>> Cc: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx>
>>>>> Subject: Re:  Re: Slow ops during index pool recovery causes cluster
>>>>> performance drop to 1%
>>>>> Email received from the internet. If in doubt, don't click any link nor open any
>>>>> attachment !
>>>>> ________________________________

>>>>> Hi Istvan,

>>>>> Is you upgraded cluster using wpq or mclock scheduler? (ceph tell osd.X config
>>>>> show | grep osd_op_queue)

>>>>> Maybe your OSDs set their osd_mclock_max_capacity_iops_* capacity too low on
>>>>> start (ceph config dump | grep osd_mclock_max_capacity_iops) limiting their
>>>>> performance.

>>>>> You might want to raise these figures if set or go back to wpq to give you
>>>>> enough time to understand how mclock works.

>>>>> Also, check bluefs_buffered_io as it's default value changed over time. Better
>>>>> run 'true' now (ceph tell osd.X config show | grep bluefs_buffered_io)
>>>>> Also, check for any overspilling as there's been a bug in the past with
>>>>> overspilling not being reported on ceph status (ceph tell osd.X bluefs stats,
>>>>> SLOW line should show 0 Bytes and 0 FILES).

>>>>> Regards,
>>>>> Frédéric.

>>>>> ----- Le 4 Nov 24, à 5:24, Istvan Szabo, Agoda Istvan.Szabo@xxxxxxxxx a écrit :

>>>>> > Hi Tyler,

>>>>> > To be honest we don't have anything set by ourselves regarding compaction and
>>>>> > rocksdb:
>>>>> > When I check the socket with ceph daemon on nvme and on ssd both have default
>>>>> > false on compactL
>>>>> > "mon_compact_on_start": "false"
>>>>> > "osd_compact_on_start": "false",

>>>>> > Rocksdb also default:
>>>>> > bluestore_rocksdb_options":
>>>>> > "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_total_wal_size=1073741824"

>>>>> > This is 1 event during the slow ops out of the 20:
>>>>>> [
>>>>>> https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt
>>>>> > |
>>>>> https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt
>>>>> ]

>>>>> > All belongs to a bucket which doing streaming operation which means continuous
>>>>> > delete and upload 24/7.

>>>>> > I can see throttled options but still don't understand why the high latency.

>>>>> > ty

>>>>> > ________________________________
>>>>> > From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>
>>>>> > Sent: Sunday, November 3, 2024 4:07 PM
>>>>> > To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>>>>> > Cc: Ceph Users <ceph-users@xxxxxxx>
>>>>> > Subject: Re:  Re: Slow ops during index pool recovery causes cluster
>>>>> > performance drop to 1%

>>>>> > Email received from the internet. If in doubt, don't click any link nor open any
>>>>> > attachment !
>>>>> > ________________________________

>>>>> > On Sun, Nov 3, 2024 at 1:28 AM Szabo, Istvan (Agoda)
>>>>> > <Istvan.Szabo@xxxxxxxxx> wrote:
>>>>> >> Hi,

>>>>> >> I'm updating from octopus to quincy and all in our cluster when index pool
>>>>> >> recovery kicks off, cluster operation drops to 1%, slow ops comes non-stop.
>>>>> >> The recovery takes 1-2 hours/nodes.

>>>>> >> What I can see the iowait on the nvme drives which belongs to the index pool is
>>>>> >> pretty high, however the throughput is less than 500MB/s, the iops is less than
>>>>> >> 5000/sec.
>>>>> > ...
>>>>> >> after update and machine reboot compaction kicks off which generates 30-40
>>>>> >> iowait on the node, we prevent with "noup" flag to put these osds into the
>>>>> >> cluster until compaction finished, however when we have 0 iowait after
>>>>> >> compaction, I unset noup so recovery can start which causes the above issue. If
>>>>> >> I wouldn't set noup it would cause even bigger issue.

>>>>> > By any chance, are you specifying a value for
>>>>> > bluestore_rocksdb_options in your ceph.conf? The compaction
>>>>> > observation at reboot in particular is odd.

>>>>> > Tyler

>>>>> > ________________________________
>>>>> > This message is confidential and is for the sole use of the intended
>>>>> > recipient(s). It may also be privileged or otherwise protected by copyright or
>>>>> > other legal rules. If you have received it by mistake please let us know by
>>>>> > reply email and delete it from your system. It is prohibited to copy this
>>>>> > message or disclose its content to anyone. Any confidentiality or privilege is
>>>>> > not waived or lost by any mistaken delivery or unauthorized disclosure of the
>>>>> > message. All messages sent to and from Agoda may be monitored to ensure
>>>>> > compliance with company policies, to protect the company's interests and to
>>>>> > remove potential malware. Electronic messages may be intercepted, amended, lost
>>>>> > or deleted, or contain viruses.
>>>>> > _______________________________________________
>>>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx

>>>>> This message is confidential and is for the sole use of the intended
>>>>> recipient(s). It may also be privileged or otherwise protected by copyright or
>>>>> other legal rules. If you have received it by mistake please let us know by
>>>>> reply email and delete it from your system. It is prohibited to copy this
>>>>> message or disclose its content to anyone. Any confidentiality or privilege is
>>>>> not waived or lost by any mistaken delivery or unauthorized disclosure of the
>>>>> message. All messages sent to and from Agoda may be monitored to ensure
>>>>> compliance with company policies, to protect the company's interests and to
>>>>> remove potential malware. Electronic messages may be intercepted, amended, lost
>>>>> or deleted, or contain viruses.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx