Re: Ceph 16.2.x mon compactions, disk writes

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Mon, 16 Oct 2023 17:05:07 +0300

With the help of community members, I managed to enable RocksDB compression
for a test monitor, and it seems to be working well.

Monitor w/o compression writes about 750 MB to disk in 5 minutes:

   4854 be/4 167           4.97 M    755.02 M  0.00 %  0.24 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]

Monitor with LZ4 compression writes about 1/4 of that over the same time
period:

2034728 be/4 167         172.00 K    199.27 M  0.00 %  0.06 % ceph-mon -n
mon.ceph05 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]

This is caused by the apparent difference in store.db sizes.

Mon store.db w/o compression:

# ls -al
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db
total 257196
drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
drwx------ 3 167 167     4096 Aug 31 05:22 ..
-rw-r--r-- 1 167 167  1517623 Oct 16 14:00 3073035.log
-rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst
-rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst
-rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst

Mon store.db with compression:

# ls -al
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db
total 91188
drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
drwx------ 3 167 167     4096 Oct 16 13:35 ..
-rw-r--r-- 1 167 167  1760114 Oct 16 14:00 012693.log
-rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst

There are no apparent downsides thus far. If everything works well, I will
try adding compression to other monitors.

/Z

On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:

> The issue persists, although to a lesser extent. Any comments from the
> Ceph team please?
>
> /Z
>
> On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:
>
>> > Some of it is transferable to RocksDB on mons nonetheless.
>>
>> Please point me to relevant Ceph documentation, i.e. a description of how
>> various Ceph monitor and RocksDB tunables affect the operations of
>> monitors, I'll gladly look into it.
>>
>> > Please point me to such recommendations, if they're on docs.ceph.com I'll
>> get them updated.
>>
>> This are the recommendations we used when we built our Pacific cluster:
>> https://docs.ceph.com/en/pacific/start/hardware-recommendations/
>>
>> Our drives are 4x times larger than recommended by this guide. The drives
>> are rated for < 0.5 DWPD, which is more than sufficient for boot drives and
>> storage of rarely modified files. It is not documented or suggested
>> anywhere that monitor processes write several hundred gigabytes of data per
>> day, exceeding the amount of data written by OSDs. Which is why I am not
>> convinced that what we're observing is expected behavior, but it's not easy
>> to get a definitive answer from the Ceph community.
>>
>> /Z
>>
>> On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri <anthony.datri@xxxxxxxxx>
>> wrote:
>>
>>> Some of it is transferable to RocksDB on mons nonetheless.
>>>
>>> but their specs exceed Ceph hardware recommendations by a good margin
>>>
>>>
>>> Please point me to such recommendations, if they're on docs.ceph.com I'll
>>> get them updated.
>>>
>>> On Oct 13, 2023, at 13:34, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:
>>>
>>> Thank you, Anthony. As I explained to you earlier, the article you had
>>> sent is about RocksDB tuning for Bluestore OSDs, while the issue at hand is
>>> not with OSDs but rather monitors and their RocksDB store. Indeed, the
>>> drives are not enterprise-grade, but their specs exceed Ceph hardware
>>> recommendations by a good margin, they're being used as boot drives only
>>> and aren't supposed to be written to continuously at high rates - which is
>>> what unfortunately is happening. I am trying to determine why it is
>>> happening and how the issue can be alleviated or resolved, unfortunately
>>> monitor RocksDB usage and tunables appear to be not documented at all.
>>>
>>> /Z
>>>
>>> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri <anthony.datri@xxxxxxxxx>
>>> wrote:
>>>
>>>> cf. Mark's article I sent you re RocksDB tuning.  I suspect that with
>>>> Reef you would experience fewer writes.  Universal compaction might also
>>>> help, but in the end this SSD is a client SKU and really not suited for
>>>> enterprise use.  If you had the 1TB SKU you'd get much longer life, or you
>>>> could change the overprovisioning on the ones you have.
>>>>
>>>> On Oct 13, 2023, at 12:30, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:
>>>>
>>>> I would very much appreciate it if someone with a better understanding
>>>> of
>>>> monitor internals and use of RocksDB could please chip in.
>>>>
>>>>
>>>>
>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx