Re: Slow ops during index pool recovery causes cluster performance drop to 1%

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Thu, 14 Nov 2024 11:50:03 +0100 (CET)

I don't know how many pools you have in your cluster but ~37 PGs per OSD seems quite low, especially with NVMes. You could try increasing the number of PGs on this pool and maybe the data pool also. 
I don't know how many iops this bucket receives but the fact that index is spread over only 11 rados objects could be a bottleneck with very intensive PUT/DELETE workloads. Maybe someone could confirm that. 

Also check for 'tombstones' and this topic [1] in particular, especially if the bucket receives a lot of PUT/DELETE operations in real time. 

Regards, 
Frédéric. 

[1] https://www.spinics.net/lists/ceph-users/msg81519.html 

----- Le 14 Nov 24, à 10:55, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit : 

> 156x NVME osd
> Sharding I do like 100000 objects/1 shard. Default 11 but they don't have 1.1m
> objects.

> This is the tree: [
> https://gist.github.com/Badb0yBadb0y/835a45f8e82ddfcbbd82cf28126da728 |
> https://gist.github.com/Badb0yBadb0y/835a45f8e82ddfcbbd82cf28126da728 ]

> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
> Sent: Thursday, November 14, 2024 4:28 PM
> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxx>
> Subject: Re:  Re: Slow ops during index pool recovery causes cluster
> performance drop to 1%
> Email received from the internet. If in doubt, don't click any link nor open any
> attachment !

> Hi Istvan,

>> Only thing what I have in my mind to increase the replica size from 3 to 5 so it
> > could tollerate more osd slowness with size 5 min_size 2.

> I wouldn't do that, it will only get worse as every write IO will have to wait
> for 2 mores OSDs to ACK and the slow ops you've seen refer to write IOs
> (looping over "waiting for rw locks").

> How many NVMe OSDs does this 2048 PGs RGW index pool has?

> Have you check the num_shards of this bucket that is receiving continuous
> deletes and uploads 24/7?

> Regards,
> Frédéric.

> ----- Le 14 Nov 24, à 7:16, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit
> :

>> Hi,

>> This issue was for us before update also, unluckiuly it's not gone with update
>> 😕
>> We don't use HDD, only ssd and nvme and the index pool is specifically on nvme.
>> Yes, I tried to set for the value divided by 4, no luck 🙁

>> Seems like based on metadata okay, however the device class when I've created it
>> I've defined nvme (ceph-volume lvm batch --bluestore --yes --osds-per-device 4
>> --crush-device-class nvme /dev/sdo) and in the osd tree it is nvme, but I guess
>> it means what it says by default if I don't define anything it would have been
>> ssd.
>> "bluestore_bdev_type": "ssd",
>> "default_device_class": "ssd",
>> "osd_objectstore": "bluestore",
>> "rotational": "0"

>> Only thing what I have in my mind to increase the replica size from 3 to 5 so it
>> could tollerate more osd slowness with size 5 min_size 2.

>> Again, thank you again for your ideas.

>> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
>> Sent: Wednesday, November 13, 2024 4:32 PM
>> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>> Cc: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx>
>> Subject: Re:  Re: Slow ops during index pool recovery causes cluster
>> performance drop to 1%
>> Email received from the internet. If in doubt, don't click any link nor open any
>> attachment !

>> Hi Istvan,

>> Changing the scheduler to 'wpq' could help you to quickly identify if the issue
>> you're facing is related to 'mclock' or not.

>> If you stick with mclock, depending on the rotational status of each OSD (ceph
>> osd metadata N | jq -r .rotational), you should set each OSD's spec
>> (osd_mclock_max_capacity_iops_hdd if rotational=1 or
>> osd_mclock_max_capacity_iops_ssd if rotational=0) to the value you calculated,
>> instead of letting the OSD trying to figure out and set a value that may not be
>> accurate, especially with multiple OSDs sharing the same underlying device.

>> Have you tried setting each OSD's max capacity (ceph config set osd.N
>> osd_mclock_max_capacity_iops_[hdd, ssd])?

>> Also, make sure the rotational status reported for each OSDs by ceph osd
>> metadata osd.N actually matches the underlying hardware type. This is not
>> always the case depending on how the disks are connected.
>> If it's not, you might have to force it on boot with a udev rule.

>> Regards,
>> Frédéric.

>> ----- Le 13 Nov 24, à 9:43, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit
>> :

>>> Hi Frédéric,

>>> Thank you the ideas.

>>> Cluster is half updated but on the osds which updated are:

>>> "osd_op_queue": "mclock_scheduler",
>>> "osd_op_queue_cut_off": "high",

>>> I'd say the value when I do the benchmark how ceph calculates it, it is too
>>> high. We have 4 osd on 1 nvme and it sets the value on the last osd from the 4
>>> on nvme which is the highest:
>>> 36490.280637

>>> However I changed this value already on some other fully upgraded cluster
>>> divided by 4 and didn't help.
>>> Buffered io turned on since octopus, didn't change it.

>>> For a quick check that specific osd seems like what you tell:

>>> 1 : device size 0x6fc7c00000 : own
>>> 0x[40000~4e00000,12f70000~2252d0000,23b060000~21a230000,4583e0000~20f890000,6b1630000~200000000,35a78f0000~478a20000]
>>> = 0xccc5b0000 : using 0xa60ed0000(42 GiB) : bluestore has 0x62e79f0000(396 GiB)
>>> available
>>> wal_total:0, db_total:456087987814, slow_total:0

>>> Istvan

>>> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
>>> Sent: Monday, November 4, 2024 4:14 PM
>>> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>>> Cc: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx>
>>> Subject: Re:  Re: Slow ops during index pool recovery causes cluster
>>> performance drop to 1%
>>> Email received from the internet. If in doubt, don't click any link nor open any
>>> attachment !
>>> ________________________________

>>> Hi Istvan,

>>> Is you upgraded cluster using wpq or mclock scheduler? (ceph tell osd.X config
>>> show | grep osd_op_queue)

>>> Maybe your OSDs set their osd_mclock_max_capacity_iops_* capacity too low on
>>> start (ceph config dump | grep osd_mclock_max_capacity_iops) limiting their
>>> performance.

>>> You might want to raise these figures if set or go back to wpq to give you
>>> enough time to understand how mclock works.

>>> Also, check bluefs_buffered_io as it's default value changed over time. Better
>>> run 'true' now (ceph tell osd.X config show | grep bluefs_buffered_io)
>>> Also, check for any overspilling as there's been a bug in the past with
>>> overspilling not being reported on ceph status (ceph tell osd.X bluefs stats,
>>> SLOW line should show 0 Bytes and 0 FILES).

>>> Regards,
>>> Frédéric.

>>> ----- Le 4 Nov 24, à 5:24, Istvan Szabo, Agoda Istvan.Szabo@xxxxxxxxx a écrit :

>>> > Hi Tyler,

>>> > To be honest we don't have anything set by ourselves regarding compaction and
>>> > rocksdb:
>>> > When I check the socket with ceph daemon on nvme and on ssd both have default
>>> > false on compactL
>>> > "mon_compact_on_start": "false"
>>> > "osd_compact_on_start": "false",

>>> > Rocksdb also default:
>>> > bluestore_rocksdb_options":
>>> > "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_total_wal_size=1073741824"

>>> > This is 1 event during the slow ops out of the 20:
>>>> [
>>>> https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt
>>> > |
>>> https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt
>>> ]

>>> > All belongs to a bucket which doing streaming operation which means continuous
>>> > delete and upload 24/7.

>>> > I can see throttled options but still don't understand why the high latency.

>>> > ty

>>> > ________________________________
>>> > From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>
>>> > Sent: Sunday, November 3, 2024 4:07 PM
>>> > To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>>> > Cc: Ceph Users <ceph-users@xxxxxxx>
>>> > Subject: Re:  Re: Slow ops during index pool recovery causes cluster
>>> > performance drop to 1%

>>> > Email received from the internet. If in doubt, don't click any link nor open any
>>> > attachment !
>>> > ________________________________

>>> > On Sun, Nov 3, 2024 at 1:28 AM Szabo, Istvan (Agoda)
>>> > <Istvan.Szabo@xxxxxxxxx> wrote:
>>> >> Hi,

>>> >> I'm updating from octopus to quincy and all in our cluster when index pool
>>> >> recovery kicks off, cluster operation drops to 1%, slow ops comes non-stop.
>>> >> The recovery takes 1-2 hours/nodes.

>>> >> What I can see the iowait on the nvme drives which belongs to the index pool is
>>> >> pretty high, however the throughput is less than 500MB/s, the iops is less than
>>> >> 5000/sec.
>>> > ...
>>> >> after update and machine reboot compaction kicks off which generates 30-40
>>> >> iowait on the node, we prevent with "noup" flag to put these osds into the
>>> >> cluster until compaction finished, however when we have 0 iowait after
>>> >> compaction, I unset noup so recovery can start which causes the above issue. If
>>> >> I wouldn't set noup it would cause even bigger issue.

>>> > By any chance, are you specifying a value for
>>> > bluestore_rocksdb_options in your ceph.conf? The compaction
>>> > observation at reboot in particular is odd.

>>> > Tyler

>>> > ________________________________
>>> > This message is confidential and is for the sole use of the intended
>>> > recipient(s). It may also be privileged or otherwise protected by copyright or
>>> > other legal rules. If you have received it by mistake please let us know by
>>> > reply email and delete it from your system. It is prohibited to copy this
>>> > message or disclose its content to anyone. Any confidentiality or privilege is
>>> > not waived or lost by any mistaken delivery or unauthorized disclosure of the
>>> > message. All messages sent to and from Agoda may be monitored to ensure
>>> > compliance with company policies, to protect the company's interests and to
>>> > remove potential malware. Electronic messages may be intercepted, amended, lost
>>> > or deleted, or contain viruses.
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx

>>> This message is confidential and is for the sole use of the intended
>>> recipient(s). It may also be privileged or otherwise protected by copyright or
>>> other legal rules. If you have received it by mistake please let us know by
>>> reply email and delete it from your system. It is prohibited to copy this
>>> message or disclose its content to anyone. Any confidentiality or privilege is
>>> not waived or lost by any mistaken delivery or unauthorized disclosure of the
>>> message. All messages sent to and from Agoda may be monitored to ensure
>>> compliance with company policies, to protect the company's interests and to
>>> remove potential malware. Electronic messages may be intercepted, amended, lost
>>> or deleted, or contain viruses.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx