Re: Ceph 16.2.x mon compactions, disk writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Zakhar,

I took a closer look into what the MONs really do (again with Mykola's help) and why manual compaction is triggered so frequently. With debug_paxos=20 I noticed that paxosservice and paxos triggered manual compactions. So I played with these values:

paxos_service_trim_max = 1000 (default 500)
paxos_service_trim_min = 500 (default 250)
paxos_trim_max = 1000 (default 500)
paxos_trim_min = 500 (default 250)

This reduced the amount of writes by a factor of 3 or 4, the iotop values are fluctuating a bit, of course. As Mykola suggested I created a tracker issue [1] to increase the default values since they don't seem suitable for a production environment. Although I don't have tested that in production yet I'll ask one of our customers to do that in their secondary cluster (for rbd mirroring) where they also suffer from large mon stores and heavy writes to the mon store. Your findings with the compaction were quite helpful as well, we'll test that as well. Igor mentioned that the default bluestore_rocksdb config for OSDs will enable compression because of positive test results. If we can confirm that compression works well for MONs too, compression could be enabled by default as well.

Regards,
Eugen

https://tracker.ceph.com/issues/63229

Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:

With the help of community members, I managed to enable RocksDB compression
for a test monitor, and it seems to be working well.

Monitor w/o compression writes about 750 MB to disk in 5 minutes:

   4854 be/4 167           4.97 M    755.02 M  0.00 %  0.24 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]

Monitor with LZ4 compression writes about 1/4 of that over the same time
period:

2034728 be/4 167         172.00 K    199.27 M  0.00 %  0.06 % ceph-mon -n
mon.ceph05 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]

This is caused by the apparent difference in store.db sizes.

Mon store.db w/o compression:

# ls -al
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db
total 257196
drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
drwx------ 3 167 167     4096 Aug 31 05:22 ..
-rw-r--r-- 1 167 167  1517623 Oct 16 14:00 3073035.log
-rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst
-rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst
-rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst

Mon store.db with compression:

# ls -al
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db
total 91188
drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
drwx------ 3 167 167     4096 Oct 16 13:35 ..
-rw-r--r-- 1 167 167  1760114 Oct 16 14:00 012693.log
-rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst

There are no apparent downsides thus far. If everything works well, I will
try adding compression to other monitors.

/Z

On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:

The issue persists, although to a lesser extent. Any comments from the
Ceph team please?

/Z

On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:

> Some of it is transferable to RocksDB on mons nonetheless.

Please point me to relevant Ceph documentation, i.e. a description of how
various Ceph monitor and RocksDB tunables affect the operations of
monitors, I'll gladly look into it.

> Please point me to such recommendations, if they're on docs.ceph.com I'll
get them updated.

This are the recommendations we used when we built our Pacific cluster:
https://docs.ceph.com/en/pacific/start/hardware-recommendations/

Our drives are 4x times larger than recommended by this guide. The drives
are rated for < 0.5 DWPD, which is more than sufficient for boot drives and
storage of rarely modified files. It is not documented or suggested
anywhere that monitor processes write several hundred gigabytes of data per
day, exceeding the amount of data written by OSDs. Which is why I am not
convinced that what we're observing is expected behavior, but it's not easy
to get a definitive answer from the Ceph community.

/Z

On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri <anthony.datri@xxxxxxxxx>
wrote:

Some of it is transferable to RocksDB on mons nonetheless.

but their specs exceed Ceph hardware recommendations by a good margin


Please point me to such recommendations, if they're on docs.ceph.com I'll
get them updated.

On Oct 13, 2023, at 13:34, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:

Thank you, Anthony. As I explained to you earlier, the article you had
sent is about RocksDB tuning for Bluestore OSDs, while the issue at hand is
not with OSDs but rather monitors and their RocksDB store. Indeed, the
drives are not enterprise-grade, but their specs exceed Ceph hardware
recommendations by a good margin, they're being used as boot drives only
and aren't supposed to be written to continuously at high rates - which is
what unfortunately is happening. I am trying to determine why it is
happening and how the issue can be alleviated or resolved, unfortunately
monitor RocksDB usage and tunables appear to be not documented at all.

/Z

On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri <anthony.datri@xxxxxxxxx>
wrote:

cf. Mark's article I sent you re RocksDB tuning.  I suspect that with
Reef you would experience fewer writes.  Universal compaction might also
help, but in the end this SSD is a client SKU and really not suited for
enterprise use. If you had the 1TB SKU you'd get much longer life, or you
could change the overprovisioning on the ones you have.

On Oct 13, 2023, at 12:30, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:

I would very much appreciate it if someone with a better understanding
of
monitor internals and use of RocksDB could please chip in.




_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux