Re: Ceph 16.2.x mon compactions, disk writes

Eugen Block <eblock@xxxxxx> · Wed, 11 Oct 2023 11:50:08 +0000

Can you add some more details as requested by Frank? Which mgr modules  
are enabled? What's the current 'ceph -s' output?

Is autoscaler running and doing stuff?
Is balancer running and doing stuff?
Is backfill going on?
Is recovery going on?
Is your ceph version affected by the "excessive logging to MON  
store" issue that was present starting with pacific but should have  
been addressed

Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:

We don't use CephFS at all and don't have RBD snapshots apart from some
cloning for Openstack images.

The size of mon stores isn't an issue, it's < 600 MB. But it gets
overwritten often causing lots of disk writes, and that is an issue for us.

/Z

On Wed, 11 Oct 2023 at 14:37, Eugen Block <eblock@xxxxxx> wrote:

Do you use many snapshots (rbd or cephfs)? That can cause a heavy
monitor usage, we've seen large mon stores on  customer clusters with
rbd mirroring on snapshot basis. In a healthy cluster they have mon
stores of around 2GB in size.

>> @Eugen: Was there not an option to limit logging to the MON store?

I don't recall at the moment, worth checking tough.

Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:

> Thank you, Frank.
>
> The cluster is healthy, operating normally, nothing unusual is going on.
We
> observe lots of writes by mon processes into mon rocksdb stores,
> specifically:
>
> /var/lib/ceph/mon/ceph-cephXX/store.db:
> 65M     3675511.sst
> 65M     3675512.sst
> 65M     3675513.sst
> 65M     3675514.sst
> 65M     3675515.sst
> 65M     3675516.sst
> 65M     3675517.sst
> 65M     3675518.sst
> 62M     3675519.sst
>
> The site of the files is not huge, but monitors rotate and write out
these
> files often, sometimes several times per minute, resulting in lots of
data
> written to disk. The writes coincide with "manual compaction" events
logged
> by the monitors, for example:
>
> debug 2023-10-11T11:10:10.483+0000 7f48a3a9b700  4 rocksdb:
> [compaction/compaction_job.cc:1676] [default] [JOB 70854] Compacting 1@5
+
> 9@6 files to L6, score -1.00
> debug 2023-10-11T11:10:10.483+0000 7f48a3a9b700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1697022610487624, "job": 70854, "event":
> "compaction_started", "compaction_reason": "ManualCompaction",
"files_L5":
> [3675543], "files_L6": [3675533, 3675534, 3675535, 3675536, 3675537,
> 3675538, 3675539, 3675540, 3675541], "score": -1, "input_data_size":
> 601117031}
> debug 2023-10-11T11:10:10.619+0000 7f48a3a9b700  4 rocksdb:
> [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> #3675544: 2015 keys, 67287115 bytes
> debug 2023-10-11T11:10:10.763+0000 7f48a3a9b700  4 rocksdb:
> [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> #3675545: 24343 keys, 67336225 bytes
> debug 2023-10-11T11:10:10.899+0000 7f48a3a9b700  4 rocksdb:
> [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> #3675546: 1196 keys, 67225813 bytes
> debug 2023-10-11T11:10:11.035+0000 7f48a3a9b700  4 rocksdb:
> [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> #3675547: 1049 keys, 67252678 bytes
> debug 2023-10-11T11:10:11.167+0000 7f48a3a9b700  4 rocksdb:
> [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> #3675548: 1081 keys, 67216638 bytes
> debug 2023-10-11T11:10:11.303+0000 7f48a3a9b700  4 rocksdb:
> [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> #3675549: 1196 keys, 67245376 bytes
> debug 2023-10-11T11:10:12.023+0000 7f48a3a9b700  4 rocksdb:
> [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> #3675550: 1195 keys, 67246813 bytes
> debug 2023-10-11T11:10:13.059+0000 7f48a3a9b700  4 rocksdb:
> [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> #3675551: 1205 keys, 67223302 bytes
> debug 2023-10-11T11:10:13.903+0000 7f48a3a9b700  4 rocksdb:
> [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> #3675552: 1312 keys, 56416011 bytes
> debug 2023-10-11T11:10:13.911+0000 7f48a3a9b700  4 rocksdb:
> [compaction/compaction_job.cc:1415] [default] [JOB 70854] Compacted 1@5
+
> 9@6 files to L6 => 594449971 bytes
> debug 2023-10-11T11:10:13.915+0000 7f48a3a9b700  4 rocksdb: (Original Log
> Time 2023/10/11-11:10:13.920991) [compaction/compaction_job.cc:760]
> [default] compacted to: base level 5 level multiplier 10.00 max bytes
base
> 268435456 files[0 0 0 0 0 0 9] max score 0.00, MB/sec: 175.8 rd, 173.9
wr,
> level 6, files in(1, 9) out(9) MB in(0.3, 572.9) out(566.9),
> read-write-amplify(3434.6) write-amplify(1707.7) OK, records in: 35108,
> records dropped: 516 output_compression: NoCompression
> debug 2023-10-11T11:10:13.915+0000 7f48a3a9b700  4 rocksdb: (Original Log
> Time 2023/10/11-11:10:13.921010) EVENT_LOG_v1 {"time_micros":
> 1697022613921002, "job": 70854, "event": "compaction_finished",
> "compaction_time_micros": 3418822, "compaction_time_cpu_micros": 785454,
> "output_level": 6, "num_output_files": 9, "total_output_size": 594449971,
> "num_input_records": 35108, "num_output_records": 34592,
> "num_subcompactions": 1, "output_compression": "NoCompression",
> "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0,
> "lsm_state": [0, 0, 0, 0, 0, 0, 9]}
>
> The log even mentions the huge write multiplication. I wonder whether
this
> is normal and what can be done about it.
>
> /Z
>
> On Wed, 11 Oct 2023 at 13:55, Frank Schilder <frans@xxxxxx> wrote:
>
>> I need to ask here: where exactly do you observe the hundreds of GB
>> written per day? Are the mon logs huge? Is it the mon store? Is your
>> cluster unhealthy?
>>
>> We have an octopus cluster with 1282 OSDs, 1650 ceph fs clients and
about
>> 800 librbd clients. Per week our mon logs are  about 70M, the cluster
logs
>> about 120M , the audit logs about 70M and I see between 100-200Kb/s
writes
>> to the mon store. That's in the lower-digit GB range per day. Hundreds
of
>> GB per day sound completely over the top on a healthy cluster, unless
you
>> have MGR modules changing the OSD/cluster map continuously.
>>
>> Is autoscaler running and doing stuff?
>> Is balancer running and doing stuff?
>> Is backfill going on?
>> Is recovery going on?
>> Is your ceph version affected by the "excessive logging to MON store"
>> issue that was present starting with pacific but should have been
addressed
>> by now?
>>
>> @Eugen: Was there not an option to limit logging to the MON store?
>>
>> For information to readers, we followed old recommendations from a Dell
>> white paper for building a ceph cluster and have a 1TB Raid10 array on
6x
>> write intensive SSDs for the MON stores. After 5 years we are below 10%
>> wear. Average size of the MON store for a healthy cluster is 500M-1G,
but
>> we have seen this ballooning to 100+GB in degraded conditions.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Zakhar Kirpichenko <zakhar@xxxxxxxxx>
>> Sent: Wednesday, October 11, 2023 12:00 PM
>> To: Eugen Block
>> Cc: ceph-users@xxxxxxx
>> Subject:  Re: Ceph 16.2.x mon compactions, disk writes
>>
>> Thank you, Eugen.
>>
>> I'm interested specifically to find out whether the huge amount of data
>> written by monitors is expected. It is eating through the endurance of
our
>> system drives, which were not specced for high DWPD/TBW, as this is not
a
>> documented requirement, and monitors produce hundreds of gigabytes of
>> writes per day. I am looking for ways to reduce the amount of writes, if
>> possible.
>>
>> /Z
>>
>> On Wed, 11 Oct 2023 at 12:41, Eugen Block <eblock@xxxxxx> wrote:
>>
>> > Hi,
>> >
>> > what you report is the expected behaviour, at least I see the same on
>> > all clusters. I can't answer why the compaction is required that
>> > often, but you can control the log level of the rocksdb output:
>> >
>> > ceph config set mon debug_rocksdb 1/5 (default is 4/5)
>> >
>> > This reduces the log entries and you wouldn't see the manual
>> > compaction logs anymore. There are a couple more rocksdb options but I
>> > probably wouldn't change too much, only if you know what you're doing.
>> > Maybe Igor can comment if some other tuning makes sense here.
>> >
>> > Regards,
>> > Eugen
>> >
>> > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>> >
>> > > Any input from anyone, please?
>> > >
>> > > On Tue, 10 Oct 2023 at 09:44, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
>> > wrote:
>> > >
>> > >> Any input from anyone, please?
>> > >>
>> > >> It's another thing that seems to be rather poorly documented: it's
>> > unclear
>> > >> what to expect, what 'normal' behavior should be, and what can be
done
>> > >> about the huge amount of writes by monitors.
>> > >>
>> > >> /Z
>> > >>
>> > >> On Mon, 9 Oct 2023 at 12:40, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
>> > wrote:
>> > >>
>> > >>> Hi,
>> > >>>
>> > >>> Monitors in our 16.2.14 cluster appear to quite often run "manual
>> > >>> compaction" tasks:
>> > >>>
>> > >>> debug 2023-10-09T09:30:53.888+0000 7f48a329a700  4 rocksdb:
>> > EVENT_LOG_v1
>> > >>> {"time_micros": 1696843853892760, "job": 64225, "event":
>> > "flush_started",
>> > >>> "num_memtables": 1, "num_entries": 715, "num_deletes": 251,
>> > >>> "total_data_size": 3870352, "memory_usage": 3886744,
"flush_reason":
>> > >>> "Manual Compaction"}
>> > >>> debug 2023-10-09T09:30:53.904+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:30:53.908+0000 7f48a3a9b700  4 rocksdb:
(Original
>> > Log
>> > >>> Time 2023/10/09-09:30:53.910204)
>> > [db_impl/db_impl_compaction_flush.cc:2516]
>> > >>> [default] Manual compaction from level-0 to level-5 from 'paxos ..
>> > 'paxos;
>> > >>> will stop at (end)
>> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:30:53.908+0000 7f48a3a9b700  4 rocksdb:
(Original
>> > Log
>> > >>> Time 2023/10/09-09:30:53.911004)
>> > [db_impl/db_impl_compaction_flush.cc:2516]
>> > >>> [default] Manual compaction from level-5 to level-6 from 'paxos ..
>> > 'paxos;
>> > >>> will stop at (end)
>> > >>> debug 2023-10-09T09:32:08.956+0000 7f48a329a700  4 rocksdb:
>> > EVENT_LOG_v1
>> > >>> {"time_micros": 1696843928961390, "job": 64228, "event":
>> > "flush_started",
>> > >>> "num_memtables": 1, "num_entries": 1580, "num_deletes": 502,
>> > >>> "total_data_size": 8404605, "memory_usage": 8465840,
"flush_reason":
>> > >>> "Manual Compaction"}
>> > >>> debug 2023-10-09T09:32:08.972+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:32:08.976+0000 7f48a3a9b700  4 rocksdb:
(Original
>> > Log
>> > >>> Time 2023/10/09-09:32:08.977739)
>> > [db_impl/db_impl_compaction_flush.cc:2516]
>> > >>> [default] Manual compaction from level-0 to level-5 from 'logm ..
>> > 'logm;
>> > >>> will stop at (end)
>> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:32:08.976+0000 7f48a3a9b700  4 rocksdb:
(Original
>> > Log
>> > >>> Time 2023/10/09-09:32:08.978512)
>> > [db_impl/db_impl_compaction_flush.cc:2516]
>> > >>> [default] Manual compaction from level-5 to level-6 from 'logm ..
>> > 'logm;
>> > >>> will stop at (end)
>> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:33:29.028+0000 7f48a329a700  4 rocksdb:
>> > EVENT_LOG_v1
>> > >>> {"time_micros": 1696844009033151, "job": 64231, "event":
>> > "flush_started",
>> > >>> "num_memtables": 1, "num_entries": 1430, "num_deletes": 251,
>> > >>> "total_data_size": 8975535, "memory_usage": 9035920,
"flush_reason":
>> > >>> "Manual Compaction"}
>> > >>> debug 2023-10-09T09:33:29.044+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:33:29.048+0000 7f48a3a9b700  4 rocksdb:
(Original
>> > Log
>> > >>> Time 2023/10/09-09:33:29.049585)
>> > [db_impl/db_impl_compaction_flush.cc:2516]
>> > >>> [default] Manual compaction from level-0 to level-5 from 'paxos ..
>> > 'paxos;
>> > >>> will stop at (end)
>> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
>> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> compaction
>> > >>> starting
>> > >>> debug 2023-10-09T09:33:29.048+0000 7f48a3a9b700  4 rocksdb:
(Original
>> > Log
>> > >>> Time 2023/10/09-09:33:29.050355)
>> > [db_impl/db_impl_compaction_flush.cc:2516]
>> > >>> [default] Manual compaction from level-5 to level-6 from 'paxos ..
>> > 'paxos;
>> > >>> will stop at (end)
>> > >>>
>> > >>> I have removed a lot of interim log messages to save space.
>> > >>>
>> > >>> During each compaction the monitor process writes approximately
>> 500-600
>> > >>> MB of data to disk over a short period of time. These writes add
up
>> to
>> > tens
>> > >>> of gigabytes per hour and hundreds of gigabytes per day.
>> > >>>
>> > >>> Monitor rocksdb and compaction options are default:
>> > >>>
>> > >>>     "mon_compact_on_bootstrap": "false",
>> > >>>     "mon_compact_on_start": "false",
>> > >>>     "mon_compact_on_trim": "true",
>> > >>>     "mon_rocksdb_options":
>> > >>>
>> >
>>
"write_buffer_size=33554432,compression=kNoCompression,level_compaction_dynamic_level_bytes=true",
>> > >>>
>> > >>> Is this expected behavior? Is this something I can adjust in
order to
>> > >>> extend the system storage life?
>> > >>>
>> > >>> Best regards,
>> > >>> Zakhar
>> > >>>
>> > >>
>> > > _______________________________________________
>> > > ceph-users mailing list -- ceph-users@xxxxxxx
>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx