Re: How to detect condition for offline compaction of RocksDB?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Rudenko,

There's been this bug [1] in the past preventing BlueFS alert from popping up on ceph -s due to some code refactoring. You might just be facing over spilling without noticing.
I'm saying this because you're running v16.2.13 and this bug was fixed in v16.2.14 (by [3], based on Pacific release notes [2]).

Have you checked slow_used_bytes on 'ceph tell osd.x perf dump bluefs'?

Also what's your db_total_bytes?

Regards,
Frédéric.

PS: In Quincy, the bug was fixed in 17.2.7 and in Reef, it should be fixed already since PR was merged even though I can't find it in the release notes.

[1] https://tracker.ceph.com/issues/58440
[2] https://docs.ceph.com/en/latest/releases/pacific/#v16-2-14-pacific
[3] https://github.com/ceph/ceph/pull/50932

----- Le 16 Juil 24, à 17:11, Rudenko Aleksandr ARudenko@xxxxxxx a écrit :

> Hi,
> 
> We have a big Ceph cluster (RGW case) with a lot of big buckets 10-500M objects
> with 31-1024 shards and a lot of io generated by many clients.
> Index pool placed on enterprise SSDs. We have about 120 SSDs (replication 3) and
> about 90Gb of OMAP data on each drive.
> About 75 PGs on each SSD for now. I think 75 is not enough for this amount of
> data, but I’m not sure that it is critical in our case.
> 
> The problem:
> For the last few weeks, we can see big degradation of some SSD OSDs. We can see
> a lot of Ops (150-256 in perf dump)  for a long time and we can see high avg
> ops time like 1-3 seconds (this metric is based on
> dump_historic_ops_by_duration and our avg calculation). These numbers are very
> unusual for our deployment. And it makes a big impact on our customers.
> 
> For now, we make Offline compaction of ‘degraded’ OSDs and after compaction, we
> see that all PGs of this OSD are returned on this OSD (because recover is
> disabled during compaction) and we see that OSD works perfect for some time,
> all our metrics are dropped down for few weeks or only few days…
> 
> And, I think the problem is in the rocksdb database which grows by levels and
> slows down requests.
> 
> And I have few question about compaction:
> 
> 1. Why can't automatic compaction manage this on its own?
> 2. How can I see RocksDB levels usage or some program metric which can be used
> as a condition for manual compacting? Because our metrics like request latency,
> OSD Ops count, OSD avg slow ops time are not 100% relative to rocksdb internal
> state.
> 
> We try to use perf dump, bluestore.kv_xxx_lat metrics, but I think they are
> absolutely useless. Because we can see higher values after compaction and
> restart OSD then these metrics were before compaction, lol.
> 
> We try to see OSD logs with debug_rocksdb=4, but we can’t understand this output
> compare it between good and bad OSDs:
> 
> Uptime(secs): 3610809.5 total, 600.0 interval
> Flush(GB): cumulative 50.519, interval 0.000
> AddFile(GB): cumulative 0.000, interval 0.000
> AddFile(Total Files): cumulative 0, interval 0
> AddFile(L0 Files): cumulative 0, interval 0
> AddFile(Keys): cumulative 0, interval 0
> Cumulative compaction: 475.34 GB write, 0.13 MB/s write, 453.07 GB read, 0.13
> MB/s read, 3346.3 seconds
> Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s
> read, 0.0 seconds
> Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0
> level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for
> pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0
> memtable_compaction,
> 0 memtable_slowdown, interval 0 total count
> 
> ** File Read Latency Histogram By Level [p-0] **
> 
> ** Compaction Stats [p-1] **
> Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB)
> Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt)
> Avg(sec) KeyIn KeyDrop
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0      52.0     52.0
>  0.0   1.0      0.0    155.2    343.25            176.27       746
>  0.460       0      0
>  L1      4/0   204.00 MB   0.8     96.7    52.0     44.7      96.2     51.4
>  0.0   1.8    164.6    163.6    601.87            316.34       237
>  2.540    275M  1237K
>  L2     34/0    2.09 GB   1.0    176.8    43.6    133.2     175.3     42.1
>  7.6   4.0    147.6    146.3   1226.72            590.20       535
>  2.293    519M  2672K
>  L3     79/0    4.76 GB   0.3    163.6    38.9    124.7     140.7     16.1
>  8.8   3.6    152.6    131.3   1097.53            443.60       432
>  2.541    401M    79M
>  L4    316/0   19.96 GB   0.1      0.9     0.4      0.5       0.8      0.3
>  19.7   2.0    151.1    131.1      5.94              2.53         4
>  1.485   2349K   385K
> Sum    433/0   27.01 GB   0.0    438.0   134.9    303.0     465.0    161.9
> 36.1   8.9    136.9    145.4   3275.31           1528.94      1954
> 1.676   1198M    83M
> Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0
> 0.0   0.0      0.0      0.0      0.00              0.00         0
> 0.000       0      0
> 
> and this log is not usable as ‘metric’.
> 
> Our Ceph version is: 16.2.13 and we use default bluestore_rocksdb_options.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux