Hi Michael,
thanks for the explanation! So if I understand correctly, we waste 93 GB
per OSD on unused NVME space, because only 30GB is actually used...?
And to improve the space for rocksdb, we need to plan for 300GB per
rocksdb partition in order to benefit from this advantage....
Reducing the number of small files is something we always ask of our
users, but reality is what it is ;-)
I'll have to look into how I can get an informative view on these
metrics... It's pretty overwhelming the amount of information coming out
of the ceph cluster, even when you look only superficially...
Cheers,
/Simon
On 20/08/2020 10:16, Michael Bisig wrote:
Hi Simon
As far as I know, RocksDB only uses "leveled" space on the NVME partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space above such a limit will automatically end up on slow devices.
In your setup where you have 123GB per OSD that means you only use 30GB of fast device. The DB which spills over this limit will be offloaded to the HDD and accordingly, it slows down requests and compactions.
You can proof what your OSD currently consumes with:
ceph daemon osd.X perf dump
Informative values are `db_total_bytes`, `db_used_bytes` and `slow_used_bytes`. This changes regularly because of the ongoing compactions but Prometheus mgr module exports these values such that you can track it.
Small files generally leads to bigger RocksDB, especially when you use EC, but this depends on the actual amount and file sizes.
I hope this helps.
Regards,
Michael
On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek@xxxxxxxxxxxxx> wrote:
Hi
Recently our ceph cluster (nautilus) is experiencing bluefs spillovers,
just 2 osd's and I disabled the warning for these osds.
(ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)
I'm wondering what causes this and how this can be prevented.
As I understand it the rocksdb for the OSD needs to store more than fits
on the NVME logical volume (123G for 12T OSD). A way to fix it could be
to increase the logical volume on the nvme (if there was space on the
nvme, which there isn't at the moment).
This is the current size of the cluster and how much is free:
[root@cephmon1 ~]# ceph df
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 1.8 PiB 842 TiB 974 TiB 974 TiB 53.63
TOTAL 1.8 PiB 842 TiB 974 TiB 974 TiB 53.63
POOLS:
POOL ID STORED OBJECTS USED
%USED MAX AVAIL
cephfs_data 1 572 MiB 121.26M 2.4 GiB
0 167 TiB
cephfs_metadata 2 56 GiB 5.15M 57 GiB
0 167 TiB
cephfs_data_3copy 8 201 GiB 51.68k 602 GiB
0.09 222 TiB
cephfs_data_ec83 13 643 TiB 279.75M 953 TiB
58.86 485 TiB
rbd 14 21 GiB 5.66k 64 GiB
0 222 TiB
.rgw.root 15 1.2 KiB 4 1 MiB
0 167 TiB
default.rgw.control 16 0 B 8 0 B
0 167 TiB
default.rgw.meta 17 765 B 4 1 MiB
0 167 TiB
default.rgw.log 18 0 B 207 0 B
0 167 TiB
cephfs_data_ec57 20 433 MiB 230 1.2 GiB
0 278 TiB
The amount used can still grow a bit before we need to add nodes, but
apparently we are running into the limits of our rocskdb partitions.
Did we choose a parameter (e.g. minimal object size) too small, so we
have too much objects on these spillover OSDs? Or is it that too many
small files are stored on the cephfs filesystems?
When we expand the cluster, we can choose larger nvme devices to allow
larger rocksdb partitions, but is that the right way to deal with this,
or should we adjust some parameters on the cluster that will reduce the
rocksdb size?
Cheers
/Simon
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx