Re: How to detect condition for offline compaction of RocksDB?

Rudenko Aleksandr <ARudenko@xxxxxxx> · Thu, 18 Jul 2024 11:58:59 +0000

Hi Josh, thanks!

I have one more question. I try to reproduce our OSD degradation due to massive lifecycle deletion and next step I will try to fix rocksdb_cf_compact_on_deletion. But I don't understand one thing.

Okay, default auto-compaction can't detect tombstones which are growing, but regular compaction based on file size takes place as I can see.

For example, we have a degraded OSD (256 Ops for a few hours).

And I can see that some compaction take place:

grep -E "Compaction start" /var/log/ceph/ceph-osd.75.log

2024-07-18T14:15:06.759+0300 7f6e67047700  4 rocksdb: [compaction/compaction_job.cc:1680] [default] Compaction start summary: Base version 1410 Base level 0, inputs: [366238(51MB) 366236(52MB) 366234(51MB) 366232(47MB)], [366199(67MB) 366200(67MB) 366201(67MB) 366202(11MB)]
2024-07-18T14:15:09.076+0300 7f6e67047700  4 rocksdb: [compaction/compaction_job.cc:1680] [default] Compaction start summary: Base version 1411 Base level 1, inputs: [366244(66MB)], [366204(66MB) 366205(67MB) 366206(66MB) 366207(66MB) 366208(32MB) 366209(67MB) 366210(66MB)]
2024-07-18T14:15:12.054+0300 7f6e67047700  4 rocksdb: [compaction/compaction_job.cc:1680] [default] Compaction start summary: Base version 1412 Base level 1, inputs: [366240(67MB)], [366154(55MB) 366138(67MB) 366139(67MB) 366140(67MB)]

grep -E "compaction_started|compaction_finished" /var/log/ceph/ceph-osd.75.log

2024-07-18T14:15:21.094+0300 7f6e67047700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1721301321095580, "job": 2284, "event": "compaction_started", "compaction_reason": "LevelMaxLevelSize", "files_L3": [366275], "files_L4": [366230, 364501], "score": 1.00751, "input_data_size": 89528358}
2024-07-18T14:15:21.727+0300 7f6e67047700  4 rocksdb: (Original Log Time 2024/07/18-14:15:21.728545) EVENT_LOG_v1 {"time_micros": 1721301321728537, "job": 2284, "event": "compaction_finished", "compaction_time_micros": 627038, "compaction_time_cpu_micros": 440211, "output_level": 4, "num_output_files": 1, "total_output_size": 66142074, "num_input_records": 1642765, "num_output_records": 1057434, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 4, 33, 173, 1004, 0, 0]}

And my question is: we have regular compaction that does some work. Why It doesn't help with tombstones?
Why only offline compaction can help in our case?

On 17.07.2024, 16:14, "Joshua Baergen" <jbaergen@xxxxxxxxxxxxxxxx <mailto:jbaergen@xxxxxxxxxxxxxxxx>> wrote:

Hey Aleksandr,

rocksdb_delete_range_threshold has had some downsides in the past (I
don't have a reference handy) so I don't recommend changing it.

> As I understand, tombstones in the case of RGW it's only deletions of objects, right?

It can also happen due to bucket reshards, as this will delete the old
shards after completing the reshard.

Josh

On Wed, Jul 17, 2024 at 3:23 AM Rudenko Aleksandr <ARudenko@xxxxxxx <mailto:ARudenko@xxxxxxx>> wrote:
>
> Hi Josh,
>
> Thank you for your reply!
>
> It was helpful for me, now I understand that I can't measure rocksdb degradation using program metric (
>
> In our version (16.2.13) we have this code (with new option rocksdb_cf_compact_on_deletion). We will try using it. As I understand, tombstones in the case of RGW it's only deletions of objects, right? Do you knowother cases when tombstones are generated in RGW scenario?
>
> We have another option in our version: rocksdb_delete_range_threshold
>
> Do you think it can be helpful?
>
> I think our problem is raised due to massive deletion generated by the lifecycle ruleof big bucket.
> On 16.07.2024, 19:25, "Joshua Baergen" <jbaergen@xxxxxxxxxxxxxxxx <mailto:jbaergen@xxxxxxxxxxxxxxxx> <mailto:jbaergen@xxxxxxxxxxxxxxxx <mailto:jbaergen@xxxxxxxxxxxxxxxx>>> wrote:
>
>
> Hello Aleksandr,
>
>
> What you're probably experiencing is tombstone accumulation, a known
> issue for Ceph's use of rocksdb.
>
>
> > 1. Why can't automatic compaction manage this on its own?
>
>
> rocksdb compaction is normally triggered by level fullness and not
> tombstone counts. However, there is a feature in rocksdb that can
> cause a file to be compacted if there are many tombstones found in it
> when iterated that can help immensely with tombstone accumulation
> problems, available as of 16.2.14. You can find a summary of how to
> enable it and tweak it here:
> https://www.spinics.net/lists/ceph-users/msg78514.html <https://www.spinics.net/lists/ceph-users/msg78514.html> <https://www.spinics.net/lists/ceph-users/msg78514.html> <https://www.spinics.net/lists/ceph-users/msg78514.html&gt;>
>
>
> > 2. How can I see RocksDB levels usage or some program metric which can be used as a condition for manual compacting?
>
>
> There is no tombstone counter that I'm aware of, which is really what
> you need in order to trigger compaction when appropriate.
>
>
> Josh
>
>
> On Tue, Jul 16, 2024 at 9:12 AM Rudenko Aleksandr <ARudenko@xxxxxxx <mailto:ARudenko@xxxxxxx> <mailto:ARudenko@xxxxxxx <mailto:ARudenko@xxxxxxx>>> wrote:
> >
> > Hi,
> >
> > We have a big Ceph cluster (RGW case) with a lot of big buckets 10-500M objects with 31-1024 shards and a lot of io generated by many clients.
> > Index pool placed on enterprise SSDs. We have about 120 SSDs (replication 3) and about 90Gb of OMAP data on each drive.
> > About 75 PGs on each SSD for now. I think 75 is not enough for this amount of data, but I’m not sure that it is critical in our case.
> >
> > The problem:
> > For the last few weeks, we can see big degradation of some SSD OSDs. We can see a lot of Ops (150-256 in perf dump) for a long time and we can see high avg ops time like 1-3 seconds (this metric is based on dump_historic_ops_by_duration and our avg calculation). These numbers are very unusual for our deployment. And it makes a big impact on our customers.
> >
> > For now, we make Offline compaction of ‘degraded’ OSDs and after compaction, we see that all PGs of this OSD are returned on this OSD (because recover is disabled during compaction) and we see that OSD works perfect for some time, all our metrics are dropped down for few weeks or only few days…
> >
> > And, I think the problem is in the rocksdb database which grows by levels and slows down requests.
> >
> > And I have few question about compaction:
> >
> > 1. Why can't automatic compaction manage this on its own?
> > 2. How can I see RocksDB levels usage or some program metric which can be used as a condition for manual compacting? Because our metrics like request latency, OSD Ops count, OSD avg slow ops time are not 100% relative to rocksdb internal state.
> >
> > We try to use perf dump, bluestore.kv_xxx_lat metrics, but I think they are absolutely useless. Because we can see higher values after compaction and restart OSD then these metrics were before compaction, lol.
> >
> > We try to see OSD logs with debug_rocksdb=4, but we can’t understand this output compare it between good and bad OSDs:
> >
> > Uptime(secs): 3610809.5 total, 600.0 interval
> > Flush(GB): cumulative 50.519, interval 0.000
> > AddFile(GB): cumulative 0.000, interval 0.000
> > AddFile(Total Files): cumulative 0, interval 0
> > AddFile(L0 Files): cumulative 0, interval 0
> > AddFile(Keys): cumulative 0, interval 0
> > Cumulative compaction: 475.34 GB write, 0.13 MB/s write, 453.07 GB read, 0.13 MB/s read, 3346.3 seconds
> > Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
> > Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction,
> > 0 memtable_slowdown, interval 0 total count
> >
> > ** File Read Latency Histogram By Level [p-0] **
> >
> > ** Compaction Stats [p-1] **
> > Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
> > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 52.0 52.0 0.0 1.0 0.0 155.2 343.25 176.27 746 0.460 0 0
> > L1 4/0 204.00 MB 0.8 96.7 52.0 44.7 96.2 51.4 0.0 1.8 164.6 163.6 601.87 316.34 237 2.540 275M 1237K
> > L2 34/0 2.09 GB 1.0 176.8 43.6 133.2 175.3 42.1 7.6 4.0 147.6 146.3 1226.72 590.20 535 2.293 519M 2672K
> > L3 79/0 4.76 GB 0.3 163.6 38.9 124.7 140.7 16.1 8.8 3.6 152.6 131.3 1097.53 443.60 432 2.541 401M 79M
> > L4 316/0 19.96 GB 0.1 0.9 0.4 0.5 0.8 0.3 19.7 2.0 151.1 131.1 5.94 2.53 4 1.485 2349K 385K
> > Sum 433/0 27.01 GB 0.0 438.0 134.9 303.0 465.0 161.9 36.1 8.9 136.9 145.4 3275.31 1528.94 1954 1.676 1198M 83M
> > Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0
> >
> > and this log is not usable as ‘metric’.
> >
> > Our Ceph version is: 16.2.13 and we use default bluestore_rocksdb_options.
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx> <mailto:ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>>
>
>
>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx