I don't know how many pools you have in your cluster but ~37 PGs per OSD seems quite low, especially with NVMes. You could try increasing the number of PGs on this pool and maybe the data pool also. I don't know how many iops this bucket receives but the fact that index is spread over only 11 rados objects could be a bottleneck with very intensive PUT/DELETE workloads. Maybe someone could confirm that. Also check for 'tombstones' and this topic [1] in particular, especially if the bucket receives a lot of PUT/DELETE operations in real time. Regards, Frédéric. [1] https://www.spinics.net/lists/ceph-users/msg81519.html ----- Le 14 Nov 24, à 10:55, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit : > 156x NVME osd > Sharding I do like 100000 objects/1 shard. Default 11 but they don't have 1.1m > objects. > This is the tree: [ > https://gist.github.com/Badb0yBadb0y/835a45f8e82ddfcbbd82cf28126da728 | > https://gist.github.com/Badb0yBadb0y/835a45f8e82ddfcbbd82cf28126da728 ] > From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> > Sent: Thursday, November 14, 2024 4:28 PM > To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> > Cc: Ceph Users <ceph-users@xxxxxxx> > Subject: Re: Re: Slow ops during index pool recovery causes cluster > performance drop to 1% > Email received from the internet. If in doubt, don't click any link nor open any > attachment ! > Hi Istvan, >> Only thing what I have in my mind to increase the replica size from 3 to 5 so it > > could tollerate more osd slowness with size 5 min_size 2. > I wouldn't do that, it will only get worse as every write IO will have to wait > for 2 mores OSDs to ACK and the slow ops you've seen refer to write IOs > (looping over "waiting for rw locks"). > How many NVMe OSDs does this 2048 PGs RGW index pool has? > Have you check the num_shards of this bucket that is receiving continuous > deletes and uploads 24/7? > Regards, > Frédéric. > ----- Le 14 Nov 24, à 7:16, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit > : >> Hi, >> This issue was for us before update also, unluckiuly it's not gone with update >> 😕 >> We don't use HDD, only ssd and nvme and the index pool is specifically on nvme. >> Yes, I tried to set for the value divided by 4, no luck 🙁 >> Seems like based on metadata okay, however the device class when I've created it >> I've defined nvme (ceph-volume lvm batch --bluestore --yes --osds-per-device 4 >> --crush-device-class nvme /dev/sdo) and in the osd tree it is nvme, but I guess >> it means what it says by default if I don't define anything it would have been >> ssd. >> "bluestore_bdev_type": "ssd", >> "default_device_class": "ssd", >> "osd_objectstore": "bluestore", >> "rotational": "0" >> Only thing what I have in my mind to increase the replica size from 3 to 5 so it >> could tollerate more osd slowness with size 5 min_size 2. >> Again, thank you again for your ideas. >> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> >> Sent: Wednesday, November 13, 2024 4:32 PM >> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> >> Cc: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx> >> Subject: Re: Re: Slow ops during index pool recovery causes cluster >> performance drop to 1% >> Email received from the internet. If in doubt, don't click any link nor open any >> attachment ! >> Hi Istvan, >> Changing the scheduler to 'wpq' could help you to quickly identify if the issue >> you're facing is related to 'mclock' or not. >> If you stick with mclock, depending on the rotational status of each OSD (ceph >> osd metadata N | jq -r .rotational), you should set each OSD's spec >> (osd_mclock_max_capacity_iops_hdd if rotational=1 or >> osd_mclock_max_capacity_iops_ssd if rotational=0) to the value you calculated, >> instead of letting the OSD trying to figure out and set a value that may not be >> accurate, especially with multiple OSDs sharing the same underlying device. >> Have you tried setting each OSD's max capacity (ceph config set osd.N >> osd_mclock_max_capacity_iops_[hdd, ssd])? >> Also, make sure the rotational status reported for each OSDs by ceph osd >> metadata osd.N actually matches the underlying hardware type. This is not >> always the case depending on how the disks are connected. >> If it's not, you might have to force it on boot with a udev rule. >> Regards, >> Frédéric. >> ----- Le 13 Nov 24, à 9:43, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit >> : >>> Hi Frédéric, >>> Thank you the ideas. >>> Cluster is half updated but on the osds which updated are: >>> "osd_op_queue": "mclock_scheduler", >>> "osd_op_queue_cut_off": "high", >>> I'd say the value when I do the benchmark how ceph calculates it, it is too >>> high. We have 4 osd on 1 nvme and it sets the value on the last osd from the 4 >>> on nvme which is the highest: >>> 36490.280637 >>> However I changed this value already on some other fully upgraded cluster >>> divided by 4 and didn't help. >>> Buffered io turned on since octopus, didn't change it. >>> For a quick check that specific osd seems like what you tell: >>> 1 : device size 0x6fc7c00000 : own >>> 0x[40000~4e00000,12f70000~2252d0000,23b060000~21a230000,4583e0000~20f890000,6b1630000~200000000,35a78f0000~478a20000] >>> = 0xccc5b0000 : using 0xa60ed0000(42 GiB) : bluestore has 0x62e79f0000(396 GiB) >>> available >>> wal_total:0, db_total:456087987814, slow_total:0 >>> Istvan >>> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> >>> Sent: Monday, November 4, 2024 4:14 PM >>> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> >>> Cc: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx> >>> Subject: Re: Re: Slow ops during index pool recovery causes cluster >>> performance drop to 1% >>> Email received from the internet. If in doubt, don't click any link nor open any >>> attachment ! >>> ________________________________ >>> Hi Istvan, >>> Is you upgraded cluster using wpq or mclock scheduler? (ceph tell osd.X config >>> show | grep osd_op_queue) >>> Maybe your OSDs set their osd_mclock_max_capacity_iops_* capacity too low on >>> start (ceph config dump | grep osd_mclock_max_capacity_iops) limiting their >>> performance. >>> You might want to raise these figures if set or go back to wpq to give you >>> enough time to understand how mclock works. >>> Also, check bluefs_buffered_io as it's default value changed over time. Better >>> run 'true' now (ceph tell osd.X config show | grep bluefs_buffered_io) >>> Also, check for any overspilling as there's been a bug in the past with >>> overspilling not being reported on ceph status (ceph tell osd.X bluefs stats, >>> SLOW line should show 0 Bytes and 0 FILES). >>> Regards, >>> Frédéric. >>> ----- Le 4 Nov 24, à 5:24, Istvan Szabo, Agoda Istvan.Szabo@xxxxxxxxx a écrit : >>> > Hi Tyler, >>> > To be honest we don't have anything set by ourselves regarding compaction and >>> > rocksdb: >>> > When I check the socket with ceph daemon on nvme and on ssd both have default >>> > false on compactL >>> > "mon_compact_on_start": "false" >>> > "osd_compact_on_start": "false", >>> > Rocksdb also default: >>> > bluestore_rocksdb_options": >>> > "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_total_wal_size=1073741824" >>> > This is 1 event during the slow ops out of the 20: >>>> [ >>>> https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt >>> > | >>> https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt >>> ] >>> > All belongs to a bucket which doing streaming operation which means continuous >>> > delete and upload 24/7. >>> > I can see throttled options but still don't understand why the high latency. >>> > ty >>> > ________________________________ >>> > From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx> >>> > Sent: Sunday, November 3, 2024 4:07 PM >>> > To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx> >>> > Cc: Ceph Users <ceph-users@xxxxxxx> >>> > Subject: Re: Re: Slow ops during index pool recovery causes cluster >>> > performance drop to 1% >>> > Email received from the internet. If in doubt, don't click any link nor open any >>> > attachment ! >>> > ________________________________ >>> > On Sun, Nov 3, 2024 at 1:28 AM Szabo, Istvan (Agoda) >>> > <Istvan.Szabo@xxxxxxxxx> wrote: >>> >> Hi, >>> >> I'm updating from octopus to quincy and all in our cluster when index pool >>> >> recovery kicks off, cluster operation drops to 1%, slow ops comes non-stop. >>> >> The recovery takes 1-2 hours/nodes. >>> >> What I can see the iowait on the nvme drives which belongs to the index pool is >>> >> pretty high, however the throughput is less than 500MB/s, the iops is less than >>> >> 5000/sec. >>> > ... >>> >> after update and machine reboot compaction kicks off which generates 30-40 >>> >> iowait on the node, we prevent with "noup" flag to put these osds into the >>> >> cluster until compaction finished, however when we have 0 iowait after >>> >> compaction, I unset noup so recovery can start which causes the above issue. If >>> >> I wouldn't set noup it would cause even bigger issue. >>> > By any chance, are you specifying a value for >>> > bluestore_rocksdb_options in your ceph.conf? The compaction >>> > observation at reboot in particular is odd. >>> > Tyler >>> > ________________________________ >>> > This message is confidential and is for the sole use of the intended >>> > recipient(s). It may also be privileged or otherwise protected by copyright or >>> > other legal rules. If you have received it by mistake please let us know by >>> > reply email and delete it from your system. It is prohibited to copy this >>> > message or disclose its content to anyone. Any confidentiality or privilege is >>> > not waived or lost by any mistaken delivery or unauthorized disclosure of the >>> > message. All messages sent to and from Agoda may be monitored to ensure >>> > compliance with company policies, to protect the company's interests and to >>> > remove potential malware. Electronic messages may be intercepted, amended, lost >>> > or deleted, or contain viruses. >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users@xxxxxxx >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> This message is confidential and is for the sole use of the intended >>> recipient(s). It may also be privileged or otherwise protected by copyright or >>> other legal rules. If you have received it by mistake please let us know by >>> reply email and delete it from your system. It is prohibited to copy this >>> message or disclose its content to anyone. Any confidentiality or privilege is >>> not waived or lost by any mistaken delivery or unauthorized disclosure of the >>> message. All messages sent to and from Agoda may be monitored to ensure >>> compliance with company policies, to protect the company's interests and to >>> remove potential malware. Electronic messages may be intercepted, amended, lost >>> or deleted, or contain viruses. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx