Re: Ceph 16.2.x mon compactions, disk writes

Frank Schilder <frans@xxxxxx> · Wed, 11 Oct 2023 10:55:53 +0000

I need to ask here: where exactly do you observe the hundreds of GB written per day? Are the mon logs huge? Is it the mon store? Is your cluster unhealthy?

We have an octopus cluster with 1282 OSDs, 1650 ceph fs clients and about 800 librbd clients. Per week our mon logs are  about 70M, the cluster logs about 120M , the audit logs about 70M and I see between 100-200Kb/s writes to the mon store. That's in the lower-digit GB range per day. Hundreds of GB per day sound completely over the top on a healthy cluster, unless you have MGR modules changing the OSD/cluster map continuously.

Is autoscaler running and doing stuff?
Is balancer running and doing stuff?
Is backfill going on?
Is recovery going on?
Is your ceph version affected by the "excessive logging to MON store" issue that was present starting with pacific but should have been addressed by now?

@Eugen: Was there not an option to limit logging to the MON store?

For information to readers, we followed old recommendations from a Dell white paper for building a ceph cluster and have a 1TB Raid10 array on 6x write intensive SSDs for the MON stores. After 5 years we are below 10% wear. Average size of the MON store for a healthy cluster is 500M-1G, but we have seen this ballooning to 100+GB in degraded conditions.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Zakhar Kirpichenko <zakhar@xxxxxxxxx>
Sent: Wednesday, October 11, 2023 12:00 PM
To: Eugen Block
Cc: ceph-users@xxxxxxx
Subject:  Re: Ceph 16.2.x mon compactions, disk writes

Thank you, Eugen.

I'm interested specifically to find out whether the huge amount of data
written by monitors is expected. It is eating through the endurance of our
system drives, which were not specced for high DWPD/TBW, as this is not a
documented requirement, and monitors produce hundreds of gigabytes of
writes per day. I am looking for ways to reduce the amount of writes, if
possible.

/Z

On Wed, 11 Oct 2023 at 12:41, Eugen Block <eblock@xxxxxx> wrote:

> Hi,
>
> what you report is the expected behaviour, at least I see the same on
> all clusters. I can't answer why the compaction is required that
> often, but you can control the log level of the rocksdb output:
>
> ceph config set mon debug_rocksdb 1/5 (default is 4/5)
>
> This reduces the log entries and you wouldn't see the manual
> compaction logs anymore. There are a couple more rocksdb options but I
> probably wouldn't change too much, only if you know what you're doing.
> Maybe Igor can comment if some other tuning makes sense here.
>
> Regards,
> Eugen
>
> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>
> > Any input from anyone, please?
> >
> > On Tue, 10 Oct 2023 at 09:44, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> wrote:
> >
> >> Any input from anyone, please?
> >>
> >> It's another thing that seems to be rather poorly documented: it's
> unclear
> >> what to expect, what 'normal' behavior should be, and what can be done
> >> about the huge amount of writes by monitors.
> >>
> >> /Z
> >>
> >> On Mon, 9 Oct 2023 at 12:40, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> Monitors in our 16.2.14 cluster appear to quite often run "manual
> >>> compaction" tasks:
> >>>
> >>> debug 2023-10-09T09:30:53.888+0000 7f48a329a700  4 rocksdb:
> EVENT_LOG_v1
> >>> {"time_micros": 1696843853892760, "job": 64225, "event":
> "flush_started",
> >>> "num_memtables": 1, "num_entries": 715, "num_deletes": 251,
> >>> "total_data_size": 3870352, "memory_usage": 3886744, "flush_reason":
> >>> "Manual Compaction"}
> >>> debug 2023-10-09T09:30:53.904+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:30:53.908+0000 7f48a3a9b700  4 rocksdb: (Original
> Log
> >>> Time 2023/10/09-09:30:53.910204)
> [db_impl/db_impl_compaction_flush.cc:2516]
> >>> [default] Manual compaction from level-0 to level-5 from 'paxos ..
> 'paxos;
> >>> will stop at (end)
> >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:30:53.908+0000 7f48a3a9b700  4 rocksdb: (Original
> Log
> >>> Time 2023/10/09-09:30:53.911004)
> [db_impl/db_impl_compaction_flush.cc:2516]
> >>> [default] Manual compaction from level-5 to level-6 from 'paxos ..
> 'paxos;
> >>> will stop at (end)
> >>> debug 2023-10-09T09:32:08.956+0000 7f48a329a700  4 rocksdb:
> EVENT_LOG_v1
> >>> {"time_micros": 1696843928961390, "job": 64228, "event":
> "flush_started",
> >>> "num_memtables": 1, "num_entries": 1580, "num_deletes": 502,
> >>> "total_data_size": 8404605, "memory_usage": 8465840, "flush_reason":
> >>> "Manual Compaction"}
> >>> debug 2023-10-09T09:32:08.972+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:32:08.976+0000 7f48a3a9b700  4 rocksdb: (Original
> Log
> >>> Time 2023/10/09-09:32:08.977739)
> [db_impl/db_impl_compaction_flush.cc:2516]
> >>> [default] Manual compaction from level-0 to level-5 from 'logm ..
> 'logm;
> >>> will stop at (end)
> >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:32:08.976+0000 7f48a3a9b700  4 rocksdb: (Original
> Log
> >>> Time 2023/10/09-09:32:08.978512)
> [db_impl/db_impl_compaction_flush.cc:2516]
> >>> [default] Manual compaction from level-5 to level-6 from 'logm ..
> 'logm;
> >>> will stop at (end)
> >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:33:29.028+0000 7f48a329a700  4 rocksdb:
> EVENT_LOG_v1
> >>> {"time_micros": 1696844009033151, "job": 64231, "event":
> "flush_started",
> >>> "num_memtables": 1, "num_entries": 1430, "num_deletes": 251,
> >>> "total_data_size": 8975535, "memory_usage": 9035920, "flush_reason":
> >>> "Manual Compaction"}
> >>> debug 2023-10-09T09:33:29.044+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:33:29.048+0000 7f48a3a9b700  4 rocksdb: (Original
> Log
> >>> Time 2023/10/09-09:33:29.049585)
> [db_impl/db_impl_compaction_flush.cc:2516]
> >>> [default] Manual compaction from level-0 to level-5 from 'paxos ..
> 'paxos;
> >>> will stop at (end)
> >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:33:29.048+0000 7f48a3a9b700  4 rocksdb: (Original
> Log
> >>> Time 2023/10/09-09:33:29.050355)
> [db_impl/db_impl_compaction_flush.cc:2516]
> >>> [default] Manual compaction from level-5 to level-6 from 'paxos ..
> 'paxos;
> >>> will stop at (end)
> >>>
> >>> I have removed a lot of interim log messages to save space.
> >>>
> >>> During each compaction the monitor process writes approximately 500-600
> >>> MB of data to disk over a short period of time. These writes add up to
> tens
> >>> of gigabytes per hour and hundreds of gigabytes per day.
> >>>
> >>> Monitor rocksdb and compaction options are default:
> >>>
> >>>     "mon_compact_on_bootstrap": "false",
> >>>     "mon_compact_on_start": "false",
> >>>     "mon_compact_on_trim": "true",
> >>>     "mon_rocksdb_options":
> >>>
> "write_buffer_size=33554432,compression=kNoCompression,level_compaction_dynamic_level_bytes=true",
> >>>
> >>> Is this expected behavior? Is this something I can adjust in order to
> >>> extend the system storage life?
> >>>
> >>> Best regards,
> >>> Zakhar
> >>>
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx