How many RGW gateways? With 300 update requests per second, I would start by increasing the number of shards. Frédéric. ----- Le 14 Nov 24, à 13:33, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit : > This bucket receives 300 post/put/delete a sec. > I'll take a look at that, thank you. > 37x4/nvme, however yes, I think we need to increase for now. > Thank you. > From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> > Sent: Thursday, November 14, 2024 5:50 PM > To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> > Cc: Ceph Users <ceph-users@xxxxxxx> > Subject: Re: Re: Slow ops during index pool recovery causes cluster > performance drop to 1% > Email received from the internet. If in doubt, don't click any link nor open any > attachment ! > I don't know how many pools you have in your cluster but ~37 PGs per OSD seems > quite low, especially with NVMes. You could try increasing the number of PGs on > this pool and maybe the data pool also. > I don't know how many iops this bucket receives but the fact that index is > spread over only 11 rados objects could be a bottleneck with very intensive > PUT/DELETE workloads. Maybe someone could confirm that. > Also check for 'tombstones' and this topic [1] in particular, especially if the > bucket receives a lot of PUT/DELETE operations in real time. > Regards, > Frédéric. > [1] https://www.spinics.net/lists/ceph-users/msg81519.html > ----- Le 14 Nov 24, à 10:55, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a > écrit : >> 156x NVME osd >> Sharding I do like 100000 objects/1 shard. Default 11 but they don't have 1.1m >> objects. >> This is the tree: [ >> https://gist.github.com/Badb0yBadb0y/835a45f8e82ddfcbbd82cf28126da728 | >> https://gist.github.com/Badb0yBadb0y/835a45f8e82ddfcbbd82cf28126da728 ] >> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> >> Sent: Thursday, November 14, 2024 4:28 PM >> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> >> Cc: Ceph Users <ceph-users@xxxxxxx> >> Subject: Re: Re: Slow ops during index pool recovery causes cluster >> performance drop to 1% >> Email received from the internet. If in doubt, don't click any link nor open any >> attachment ! >> Hi Istvan, >>> Only thing what I have in my mind to increase the replica size from 3 to 5 so it >> > could tollerate more osd slowness with size 5 min_size 2. >> I wouldn't do that, it will only get worse as every write IO will have to wait >> for 2 mores OSDs to ACK and the slow ops you've seen refer to write IOs >> (looping over "waiting for rw locks"). >> How many NVMe OSDs does this 2048 PGs RGW index pool has? >> Have you check the num_shards of this bucket that is receiving continuous >> deletes and uploads 24/7? >> Regards, >> Frédéric. >> ----- Le 14 Nov 24, à 7:16, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit >> : >>> Hi, >>> This issue was for us before update also, unluckiuly it's not gone with update >>> 😕 >>> We don't use HDD, only ssd and nvme and the index pool is specifically on nvme. >>> Yes, I tried to set for the value divided by 4, no luck 🙁 >>> Seems like based on metadata okay, however the device class when I've created it >>> I've defined nvme (ceph-volume lvm batch --bluestore --yes --osds-per-device 4 >>> --crush-device-class nvme /dev/sdo) and in the osd tree it is nvme, but I guess >>> it means what it says by default if I don't define anything it would have been >>> ssd. >>> "bluestore_bdev_type": "ssd", >>> "default_device_class": "ssd", >>> "osd_objectstore": "bluestore", >>> "rotational": "0" >>> Only thing what I have in my mind to increase the replica size from 3 to 5 so it >>> could tollerate more osd slowness with size 5 min_size 2. >>> Again, thank you again for your ideas. >>> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> >>> Sent: Wednesday, November 13, 2024 4:32 PM >>> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> >>> Cc: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx> >>> Subject: Re: Re: Slow ops during index pool recovery causes cluster >>> performance drop to 1% >>> Email received from the internet. If in doubt, don't click any link nor open any >>> attachment ! >>> Hi Istvan, >>> Changing the scheduler to 'wpq' could help you to quickly identify if the issue >>> you're facing is related to 'mclock' or not. >>> If you stick with mclock, depending on the rotational status of each OSD (ceph >>> osd metadata N | jq -r .rotational), you should set each OSD's spec >>> (osd_mclock_max_capacity_iops_hdd if rotational=1 or >>> osd_mclock_max_capacity_iops_ssd if rotational=0) to the value you calculated, >>> instead of letting the OSD trying to figure out and set a value that may not be >>> accurate, especially with multiple OSDs sharing the same underlying device. >>> Have you tried setting each OSD's max capacity (ceph config set osd.N >>> osd_mclock_max_capacity_iops_[hdd, ssd])? >>> Also, make sure the rotational status reported for each OSDs by ceph osd >>> metadata osd.N actually matches the underlying hardware type. This is not >>> always the case depending on how the disks are connected. >>> If it's not, you might have to force it on boot with a udev rule. >>> Regards, >>> Frédéric. >>> ----- Le 13 Nov 24, à 9:43, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit >>> : >>>> Hi Frédéric, >>>> Thank you the ideas. >>>> Cluster is half updated but on the osds which updated are: >>>> "osd_op_queue": "mclock_scheduler", >>>> "osd_op_queue_cut_off": "high", >>>> I'd say the value when I do the benchmark how ceph calculates it, it is too >>>> high. We have 4 osd on 1 nvme and it sets the value on the last osd from the 4 >>>> on nvme which is the highest: >>>> 36490.280637 >>>> However I changed this value already on some other fully upgraded cluster >>>> divided by 4 and didn't help. >>>> Buffered io turned on since octopus, didn't change it. >>>> For a quick check that specific osd seems like what you tell: >>>> 1 : device size 0x6fc7c00000 : own >>>> 0x[40000~4e00000,12f70000~2252d0000,23b060000~21a230000,4583e0000~20f890000,6b1630000~200000000,35a78f0000~478a20000] >>>> = 0xccc5b0000 : using 0xa60ed0000(42 GiB) : bluestore has 0x62e79f0000(396 GiB) >>>> available >>>> wal_total:0, db_total:456087987814, slow_total:0 >>>> Istvan >>>> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> >>>> Sent: Monday, November 4, 2024 4:14 PM >>>> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> >>>> Cc: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx> >>>> Subject: Re: Re: Slow ops during index pool recovery causes cluster >>>> performance drop to 1% >>>> Email received from the internet. If in doubt, don't click any link nor open any >>>> attachment ! >>>> ________________________________ >>>> Hi Istvan, >>>> Is you upgraded cluster using wpq or mclock scheduler? (ceph tell osd.X config >>>> show | grep osd_op_queue) >>>> Maybe your OSDs set their osd_mclock_max_capacity_iops_* capacity too low on >>>> start (ceph config dump | grep osd_mclock_max_capacity_iops) limiting their >>>> performance. >>>> You might want to raise these figures if set or go back to wpq to give you >>>> enough time to understand how mclock works. >>>> Also, check bluefs_buffered_io as it's default value changed over time. Better >>>> run 'true' now (ceph tell osd.X config show | grep bluefs_buffered_io) >>>> Also, check for any overspilling as there's been a bug in the past with >>>> overspilling not being reported on ceph status (ceph tell osd.X bluefs stats, >>>> SLOW line should show 0 Bytes and 0 FILES). >>>> Regards, >>>> Frédéric. >>>> ----- Le 4 Nov 24, à 5:24, Istvan Szabo, Agoda Istvan.Szabo@xxxxxxxxx a écrit : >>>> > Hi Tyler, >>>> > To be honest we don't have anything set by ourselves regarding compaction and >>>> > rocksdb: >>>> > When I check the socket with ceph daemon on nvme and on ssd both have default >>>> > false on compactL >>>> > "mon_compact_on_start": "false" >>>> > "osd_compact_on_start": "false", >>>> > Rocksdb also default: >>>> > bluestore_rocksdb_options": >>>> > "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_total_wal_size=1073741824" >>>> > This is 1 event during the slow ops out of the 20: >>>>> [ >>>>> https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt >>>> > | >>>> https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt >>>> ] >>>> > All belongs to a bucket which doing streaming operation which means continuous >>>> > delete and upload 24/7. >>>> > I can see throttled options but still don't understand why the high latency. >>>> > ty >>>> > ________________________________ >>>> > From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx> >>>> > Sent: Sunday, November 3, 2024 4:07 PM >>>> > To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> >>>> > Cc: Ceph Users <ceph-users@xxxxxxx> >>>> > Subject: Re: Re: Slow ops during index pool recovery causes cluster >>>> > performance drop to 1% >>>> > Email received from the internet. If in doubt, don't click any link nor open any >>>> > attachment ! >>>> > ________________________________ >>>> > On Sun, Nov 3, 2024 at 1:28 AM Szabo, Istvan (Agoda) >>>> > <Istvan.Szabo@xxxxxxxxx> wrote: >>>> >> Hi, >>>> >> I'm updating from octopus to quincy and all in our cluster when index pool >>>> >> recovery kicks off, cluster operation drops to 1%, slow ops comes non-stop. >>>> >> The recovery takes 1-2 hours/nodes. >>>> >> What I can see the iowait on the nvme drives which belongs to the index pool is >>>> >> pretty high, however the throughput is less than 500MB/s, the iops is less than >>>> >> 5000/sec. >>>> > ... >>>> >> after update and machine reboot compaction kicks off which generates 30-40 >>>> >> iowait on the node, we prevent with "noup" flag to put these osds into the >>>> >> cluster until compaction finished, however when we have 0 iowait after >>>> >> compaction, I unset noup so recovery can start which causes the above issue. If >>>> >> I wouldn't set noup it would cause even bigger issue. >>>> > By any chance, are you specifying a value for >>>> > bluestore_rocksdb_options in your ceph.conf? The compaction >>>> > observation at reboot in particular is odd. >>>> > Tyler >>>> > ________________________________ >>>> > This message is confidential and is for the sole use of the intended >>>> > recipient(s). It may also be privileged or otherwise protected by copyright or >>>> > other legal rules. If you have received it by mistake please let us know by >>>> > reply email and delete it from your system. It is prohibited to copy this >>>> > message or disclose its content to anyone. Any confidentiality or privilege is >>>> > not waived or lost by any mistaken delivery or unauthorized disclosure of the >>>> > message. All messages sent to and from Agoda may be monitored to ensure >>>> > compliance with company policies, to protect the company's interests and to >>>> > remove potential malware. Electronic messages may be intercepted, amended, lost >>>> > or deleted, or contain viruses. >>>> > _______________________________________________ >>>> > ceph-users mailing list -- ceph-users@xxxxxxx >>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> This message is confidential and is for the sole use of the intended >>>> recipient(s). It may also be privileged or otherwise protected by copyright or >>>> other legal rules. If you have received it by mistake please let us know by >>>> reply email and delete it from your system. It is prohibited to copy this >>>> message or disclose its content to anyone. Any confidentiality or privilege is >>>> not waived or lost by any mistaken delivery or unauthorized disclosure of the >>>> message. All messages sent to and from Agoda may be monitored to ensure >>>> compliance with company policies, to protect the company's interests and to >>>> remove potential malware. Electronic messages may be intercepted, amended, lost >>>> or deleted, or contain viruses. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx