Hi Frédéric, Thank you for your reply! We can't face spillover because we use dedicated ssd OSDs (no slow device) which are mapped to index pool in our RGW deployment: ceph tell osd.60 perf dump bluefs | grep slow "slow_total_bytes": 0, "slow_used_bytes": 0, "bytes_written_slow": 0, "max_bytes_slow": 0, On 17.07.2024, 09:45, "Frédéric Nass" <frederic.nass@xxxxxxxxxxxxxxxx <mailto:frederic.nass@xxxxxxxxxxxxxxxx>> wrote: Hi Rudenko, There's been this bug [1] in the past preventing BlueFS alert from popping up on ceph -s due to some code refactoring. You might just be facing over spilling without noticing. I'm saying this because you're running v16.2.13 and this bug was fixed in v16.2.14 (by [3], based on Pacific release notes [2]). Have you checked slow_used_bytes on 'ceph tell osd.x perf dump bluefs'? Also what's your db_total_bytes? Regards, Frédéric. PS: In Quincy, the bug was fixed in 17.2.7 and in Reef, it should be fixed already since PR was merged even though I can't find it in the release notes. [1] https://tracker.ceph.com/issues/58440 <https://tracker.ceph.com/issues/58440> [2] https://docs.ceph.com/en/latest/releases/pacific/#v16-2-14-pacific <https://docs.ceph.com/en/latest/releases/pacific/#v16-2-14-pacific> [3] https://github.com/ceph/ceph/pull/50932 <https://github.com/ceph/ceph/pull/50932> ----- Le 16 Juil 24, à 17:11, Rudenko Aleksandr ARudenko@xxxxxxx <mailto:ARudenko@xxxxxxx> a écrit : > Hi, > > We have a big Ceph cluster (RGW case) with a lot of big buckets 10-500M objects > with 31-1024 shards and a lot of io generated by many clients. > Index pool placed on enterprise SSDs. We have about 120 SSDs (replication 3) and > about 90Gb of OMAP data on each drive. > About 75 PGs on each SSD for now. I think 75 is not enough for this amount of > data, but I’m not sure that it is critical in our case. > > The problem: > For the last few weeks, we can see big degradation of some SSD OSDs. We can see > a lot of Ops (150-256 in perf dump) for a long time and we can see high avg > ops time like 1-3 seconds (this metric is based on > dump_historic_ops_by_duration and our avg calculation). These numbers are very > unusual for our deployment. And it makes a big impact on our customers. > > For now, we make Offline compaction of ‘degraded’ OSDs and after compaction, we > see that all PGs of this OSD are returned on this OSD (because recover is > disabled during compaction) and we see that OSD works perfect for some time, > all our metrics are dropped down for few weeks or only few days… > > And, I think the problem is in the rocksdb database which grows by levels and > slows down requests. > > And I have few question about compaction: > > 1. Why can't automatic compaction manage this on its own? > 2. How can I see RocksDB levels usage or some program metric which can be used > as a condition for manual compacting? Because our metrics like request latency, > OSD Ops count, OSD avg slow ops time are not 100% relative to rocksdb internal > state. > > We try to use perf dump, bluestore.kv_xxx_lat metrics, but I think they are > absolutely useless. Because we can see higher values after compaction and > restart OSD then these metrics were before compaction, lol. > > We try to see OSD logs with debug_rocksdb=4, but we can’t understand this output > compare it between good and bad OSDs: > > Uptime(secs): 3610809.5 total, 600.0 interval > Flush(GB): cumulative 50.519, interval 0.000 > AddFile(GB): cumulative 0.000, interval 0.000 > AddFile(Total Files): cumulative 0, interval 0 > AddFile(L0 Files): cumulative 0, interval 0 > AddFile(Keys): cumulative 0, interval 0 > Cumulative compaction: 475.34 GB write, 0.13 MB/s write, 453.07 GB read, 0.13 > MB/s read, 3346.3 seconds > Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s > read, 0.0 seconds > Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 > level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for > pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 > memtable_compaction, > 0 memtable_slowdown, interval 0 total count > > ** File Read Latency Histogram By Level [p-0] ** > > ** Compaction Stats [p-1] ** > Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) > Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) > Avg(sec) KeyIn KeyDrop > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 52.0 52.0 > 0.0 1.0 0.0 155.2 343.25 176.27 746 > 0.460 0 0 > L1 4/0 204.00 MB 0.8 96.7 52.0 44.7 96.2 51.4 > 0.0 1.8 164.6 163.6 601.87 316.34 237 > 2.540 275M 1237K > L2 34/0 2.09 GB 1.0 176.8 43.6 133.2 175.3 42.1 > 7.6 4.0 147.6 146.3 1226.72 590.20 535 > 2.293 519M 2672K > L3 79/0 4.76 GB 0.3 163.6 38.9 124.7 140.7 16.1 > 8.8 3.6 152.6 131.3 1097.53 443.60 432 > 2.541 401M 79M > L4 316/0 19.96 GB 0.1 0.9 0.4 0.5 0.8 0.3 > 19.7 2.0 151.1 131.1 5.94 2.53 4 > 1.485 2349K 385K > Sum 433/0 27.01 GB 0.0 438.0 134.9 303.0 465.0 161.9 > 36.1 8.9 136.9 145.4 3275.31 1528.94 1954 > 1.676 1198M 83M > Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 > 0.0 0.0 0.0 0.0 0.00 0.00 0 > 0.000 0 0 > > and this log is not usable as ‘metric’. > > Our Ceph version is: 16.2.13 and we use default bluestore_rocksdb_options. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> > To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx