Re: How to detect condition for offline compaction of RocksDB?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Frédéric,

Thank you for your reply!

We can't face spillover because we use dedicated ssd OSDs (no slow device) which are mapped to index pool in our RGW deployment:

ceph tell osd.60 perf dump bluefs | grep slow
        "slow_total_bytes": 0,
        "slow_used_bytes": 0,
        "bytes_written_slow": 0,
        "max_bytes_slow": 0,

On 17.07.2024, 09:45, "Frédéric Nass" <frederic.nass@xxxxxxxxxxxxxxxx <mailto:frederic.nass@xxxxxxxxxxxxxxxx>> wrote:


Hi Rudenko,


There's been this bug [1] in the past preventing BlueFS alert from popping up on ceph -s due to some code refactoring. You might just be facing over spilling without noticing.
I'm saying this because you're running v16.2.13 and this bug was fixed in v16.2.14 (by [3], based on Pacific release notes [2]).


Have you checked slow_used_bytes on 'ceph tell osd.x perf dump bluefs'?


Also what's your db_total_bytes?


Regards,
Frédéric.


PS: In Quincy, the bug was fixed in 17.2.7 and in Reef, it should be fixed already since PR was merged even though I can't find it in the release notes.


[1] https://tracker.ceph.com/issues/58440 <https://tracker.ceph.com/issues/58440>
[2] https://docs.ceph.com/en/latest/releases/pacific/#v16-2-14-pacific <https://docs.ceph.com/en/latest/releases/pacific/#v16-2-14-pacific>
[3] https://github.com/ceph/ceph/pull/50932 <https://github.com/ceph/ceph/pull/50932>


----- Le 16 Juil 24, à 17:11, Rudenko Aleksandr ARudenko@xxxxxxx <mailto:ARudenko@xxxxxxx> a écrit :


> Hi,
> 
> We have a big Ceph cluster (RGW case) with a lot of big buckets 10-500M objects
> with 31-1024 shards and a lot of io generated by many clients.
> Index pool placed on enterprise SSDs. We have about 120 SSDs (replication 3) and
> about 90Gb of OMAP data on each drive.
> About 75 PGs on each SSD for now. I think 75 is not enough for this amount of
> data, but I’m not sure that it is critical in our case.
> 
> The problem:
> For the last few weeks, we can see big degradation of some SSD OSDs. We can see
> a lot of Ops (150-256 in perf dump) for a long time and we can see high avg
> ops time like 1-3 seconds (this metric is based on
> dump_historic_ops_by_duration and our avg calculation). These numbers are very
> unusual for our deployment. And it makes a big impact on our customers.
> 
> For now, we make Offline compaction of ‘degraded’ OSDs and after compaction, we
> see that all PGs of this OSD are returned on this OSD (because recover is
> disabled during compaction) and we see that OSD works perfect for some time,
> all our metrics are dropped down for few weeks or only few days…
> 
> And, I think the problem is in the rocksdb database which grows by levels and
> slows down requests.
> 
> And I have few question about compaction:
> 
> 1. Why can't automatic compaction manage this on its own?
> 2. How can I see RocksDB levels usage or some program metric which can be used
> as a condition for manual compacting? Because our metrics like request latency,
> OSD Ops count, OSD avg slow ops time are not 100% relative to rocksdb internal
> state.
> 
> We try to use perf dump, bluestore.kv_xxx_lat metrics, but I think they are
> absolutely useless. Because we can see higher values after compaction and
> restart OSD then these metrics were before compaction, lol.
> 
> We try to see OSD logs with debug_rocksdb=4, but we can’t understand this output
> compare it between good and bad OSDs:
> 
> Uptime(secs): 3610809.5 total, 600.0 interval
> Flush(GB): cumulative 50.519, interval 0.000
> AddFile(GB): cumulative 0.000, interval 0.000
> AddFile(Total Files): cumulative 0, interval 0
> AddFile(L0 Files): cumulative 0, interval 0
> AddFile(Keys): cumulative 0, interval 0
> Cumulative compaction: 475.34 GB write, 0.13 MB/s write, 453.07 GB read, 0.13
> MB/s read, 3346.3 seconds
> Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s
> read, 0.0 seconds
> Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0
> level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for
> pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0
> memtable_compaction,
> 0 memtable_slowdown, interval 0 total count
> 
> ** File Read Latency Histogram By Level [p-0] **
> 
> ** Compaction Stats [p-1] **
> Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB)
> Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt)
> Avg(sec) KeyIn KeyDrop
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 52.0 52.0
> 0.0 1.0 0.0 155.2 343.25 176.27 746
> 0.460 0 0
> L1 4/0 204.00 MB 0.8 96.7 52.0 44.7 96.2 51.4
> 0.0 1.8 164.6 163.6 601.87 316.34 237
> 2.540 275M 1237K
> L2 34/0 2.09 GB 1.0 176.8 43.6 133.2 175.3 42.1
> 7.6 4.0 147.6 146.3 1226.72 590.20 535
> 2.293 519M 2672K
> L3 79/0 4.76 GB 0.3 163.6 38.9 124.7 140.7 16.1
> 8.8 3.6 152.6 131.3 1097.53 443.60 432
> 2.541 401M 79M
> L4 316/0 19.96 GB 0.1 0.9 0.4 0.5 0.8 0.3
> 19.7 2.0 151.1 131.1 5.94 2.53 4
> 1.485 2349K 385K
> Sum 433/0 27.01 GB 0.0 438.0 134.9 303.0 465.0 161.9
> 36.1 8.9 136.9 145.4 3275.31 1528.94 1954
> 1.676 1198M 83M
> Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0
> 0.0 0.0 0.0 0.0 0.00 0.00 0
> 0.000 0 0
> 
> and this log is not usable as ‘metric’.
> 
> Our Ceph version is: 16.2.13 and we use default bluestore_rocksdb_options.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux