Hi Seena,
this parameter isn't intended to be adjusted in production environments
- it's supposed that default behavior covers all regular customers' needs.
The issue though is that default setting is invalid. It should be
'use_some_extra'. Gonna fix that shortly...
Thanks,
Igor
On 8/20/2020 1:44 PM, Seena Fallah wrote:
Hi Igor.
Could you please tell why this config is in LEVEL_DEV
(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
As it is documented in Ceph we can't use LEVEL_DEV in production
environments!
Thanks
On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov <ifedotov@xxxxxxx
<mailto:ifedotov@xxxxxxx>> wrote:
Hi Simon,
starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space
at DB
volume.
see this PR: https://github.com/ceph/ceph/pull/29687
Nice overview on the overall BlueFS/RocksDB design can be find here:
https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
Which also includes some overview (as well as additional concerns)
for
changes brought by the above-mentioned PR.
Thanks,
Igor
On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> Hi Michael,
>
> thanks for the explanation! So if I understand correctly, we
waste 93
> GB per OSD on unused NVME space, because only 30GB is actually
used...?
>
> And to improve the space for rocksdb, we need to plan for 300GB per
> rocksdb partition in order to benefit from this advantage....
>
> Reducing the number of small files is something we always ask of
our
> users, but reality is what it is ;-)
>
> I'll have to look into how I can get an informative view on these
> metrics... It's pretty overwhelming the amount of information
coming
> out of the ceph cluster, even when you look only superficially...
>
> Cheers,
>
> /Simon
>
> On 20/08/2020 10:16, Michael Bisig wrote:
>> Hi Simon
>>
>> As far as I know, RocksDB only uses "leveled" space on the NVME
>> partition. The values are set to be 300MB, 3GB, 30GB and 300GB.
Every
>> DB space above such a limit will automatically end up on slow
devices.
>> In your setup where you have 123GB per OSD that means you only use
>> 30GB of fast device. The DB which spills over this limit will be
>> offloaded to the HDD and accordingly, it slows down requests and
>> compactions.
>>
>> You can proof what your OSD currently consumes with:
>> ceph daemon osd.X perf dump
>>
>> Informative values are `db_total_bytes`, `db_used_bytes` and
>> `slow_used_bytes`. This changes regularly because of the ongoing
>> compactions but Prometheus mgr module exports these values such
that
>> you can track it.
>>
>> Small files generally leads to bigger RocksDB, especially when you
>> use EC, but this depends on the actual amount and file sizes.
>>
>> I hope this helps.
>> Regards,
>> Michael
>>
>> On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek@xxxxxxxxxxxxx
<mailto:s.oosthoek@xxxxxxxxxxxxx>> wrote:
>>
>> Hi
>>
>> Recently our ceph cluster (nautilus) is experiencing bluefs
>> spillovers,
>> just 2 osd's and I disabled the warning for these osds.
>> (ceph config set osd.125
bluestore_warn_on_bluefs_spillover false)
>>
>> I'm wondering what causes this and how this can be prevented.
>>
>> As I understand it the rocksdb for the OSD needs to store
more
>> than fits
>> on the NVME logical volume (123G for 12T OSD). A way to
fix it
>> could be
>> to increase the logical volume on the nvme (if there was
space
>> on the
>> nvme, which there isn't at the moment).
>>
>> This is the current size of the cluster and how much is free:
>>
>> [root@cephmon1 ~]# ceph df
>> RAW STORAGE:
>> CLASS SIZE AVAIL USED RAW USED
>> %RAW USED
>> hdd 1.8 PiB 842 TiB 974 TiB 974
>> TiB 53.63
>> TOTAL 1.8 PiB 842 TiB 974 TiB 974
>> TiB 53.63
>>
>> POOLS:
>> POOL ID STORED OBJECTS USED
>> %USED MAX AVAIL
>> cephfs_data 1 572 MiB 121.26M 2.4 GiB
>> 0 167 TiB
>> cephfs_metadata 2 56 GiB 5.15M 57 GiB
>> 0 167 TiB
>> cephfs_data_3copy 8 201 GiB 51.68k 602 GiB
>> 0.09 222 TiB
>> cephfs_data_ec83 13 643 TiB 279.75M 953 TiB
>> 58.86 485 TiB
>> rbd 14 21 GiB 5.66k 64 GiB
>> 0 222 TiB
>> .rgw.root 15 1.2 KiB 4 1 MiB
>> 0 167 TiB
>> default.rgw.control 16 0 B 8 0 B
>> 0 167 TiB
>> default.rgw.meta 17 765 B 4 1 MiB
>> 0 167 TiB
>> default.rgw.log 18 0 B 207 0 B
>> 0 167 TiB
>> cephfs_data_ec57 20 433 MiB 230
1.2 GiB
>> 0 278 TiB
>>
>> The amount used can still grow a bit before we need to add
>> nodes, but
>> apparently we are running into the limits of our rocskdb
>> partitions.
>>
>> Did we choose a parameter (e.g. minimal object size) too
small,
>> so we
>> have too much objects on these spillover OSDs? Or is it
that too
>> many
>> small files are stored on the cephfs filesystems?
>>
>> When we expand the cluster, we can choose larger nvme
devices to
>> allow
>> larger rocksdb partitions, but is that the right way to deal
>> with this,
>> or should we adjust some parameters on the cluster that will
>> reduce the
>> rocksdb size?
>>
>> Cheers
>>
>> /Simon
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
<mailto:ceph-users-leave@xxxxxxx>
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx
<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx