How to detect condition for offline compaction of RocksDB?

Александр Руденко <a.rudikk@xxxxxxxxx> · Tue, 16 Jul 2024 10:57:31 +0300

Hi,

We have a big Ceph cluster (RGW case) with a lot of big buckets 10-500M
objects with 31-1024 shards and a lot of io generated by many clients.
Index pool placed on enterprise SSDs. We have about 120 SSDs (replication
3) and about 90Gb of OMAP data on each drive.
About 75 PGs on each SSD for now. I think 75 is not enough for this amount
of data, but I’m not sure that it is critical in our case.

The problem:
For the last few weeks, we can see big degradation of some SSD OSDs. We can
see a lot of Ops (150-256 in perf dump)  for a long time and we can see
high avg ops time like 1-3 seconds (this metric is based on
dump_historic_ops_by_duration and our avg calculation). These numbers are
very unusual for our deployment. And it makes a big impact on our customers.

For now, we make Offline compaction of ‘degraded’ OSDs and after
compaction, we see that all PGs of this OSD are returned on this OSD
(because recover is disabled during compaction) and we see that OSD works
perfect for some time, all our metrics are dropped down for few weeks or
only few days…

And, I think the problem is in the rocksdb database which grows by levels
and slows down requests.

And I have few question about compaction:

1. Why can't automatic compaction manage this on its own?
2. How can I see RocksDB levels usage or some program metric which can be
used as a condition for manual compacting? Because our metrics like request
latency, OSD Ops count, OSD avg slow ops time are not 100% relative to
rocksdb internal state.

We try to use perf dump, bluestore.kv_xxx_lat metrics, but I think they are
absolutely useless. Because we can see higher values after compaction and
restart OSD then these metrics were before compaction, lol.

We try to see OSD logs with debug_rocksdb=4, but we can’t understand this
output compare it between good and bad OSDs:

Uptime(secs): 3610809.5 total, 600.0 interval
Flush(GB): cumulative 50.519, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 475.34 GB write, 0.13 MB/s write, 453.07 GB read,
0.13 MB/s read, 3346.3 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00
MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0
level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for
pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0
memtable_compaction,
 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [p-0] **

** Compaction Stats [p-1] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB)
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec)
Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0      52.0     52.0
      0.0   1.0      0.0    155.2    343.25            176.27       746
 0.460       0      0
  L1      4/0   204.00 MB   0.8     96.7    52.0     44.7      96.2
51.4       0.0   1.8    164.6    163.6    601.87            316.34
237    2.540    275M  1237K
  L2     34/0    2.09 GB   1.0    176.8    43.6    133.2     175.3     42.1
      7.6   4.0    147.6    146.3   1226.72            590.20       535
 2.293    519M  2672K
  L3     79/0    4.76 GB   0.3    163.6    38.9    124.7     140.7     16.1
      8.8   3.6    152.6    131.3   1097.53            443.60       432
 2.541    401M    79M
  L4    316/0   19.96 GB   0.1      0.9     0.4      0.5       0.8      0.3
     19.7   2.0    151.1    131.1      5.94              2.53         4
 1.485   2349K   385K
 Sum    433/0   27.01 GB   0.0    438.0   134.9    303.0     465.0    161.9
     36.1   8.9    136.9    145.4   3275.31           1528.94      1954
 1.676   1198M    83M
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0
      0.0   0.0      0.0      0.0      0.00              0.00         0
 0.000       0      0

and this log is not usable as ‘metric’.

Our Ceph version is: 16.2.13 and we use default bluestore_rocksdb_options.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx