Honestly I don't have any perfect solution for now.
If this is urgent you probably better to proceed with enabling the new
DB space management feature.
But please do that eventually, modify 1-2 OSDs at the first stage and
test them for some period (may be a week or two).
Thanks,
Igor
On 8/20/2020 5:36 PM, Seena Fallah wrote:
So what do you suggest for a short term solution? (I think you won't
backport it to nautilus at least about 6 month)
Changing db size is too expensive because I should buy new NVME
devices with double size and also redeploy all my OSDs.
Manual compaction will still have an impact on performance and doing
it for a month doesn't look very good!
On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov <ifedotov@xxxxxxx
<mailto:ifedotov@xxxxxxx>> wrote:
Correct.
On 8/20/2020 5:15 PM, Seena Fallah wrote:
So you won't backport it to nautilus until it gets default to
master for a while?
On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov <ifedotov@xxxxxxx
<mailto:ifedotov@xxxxxxx>> wrote:
From technical/developer's point of view I don't see any
issues with tuning this option. But since now I wouldn't
recommend to enable it in production as it partially bypassed
our regular development cycle. Being enabled in master for a
while by default allows more develpers to use/try the feature
before release. This can be considered as an additional
implicit QA process. But as we just discovered this hasn't
happened.
Hence you can definitely try it but this exposes your
cluster(s) to some risk as for any new (and incompletely
tested) feature....
Thanks,
Igor
On 8/20/2020 4:06 PM, Seena Fallah wrote:
Greate, thanks.
Is it safe to change it manually in ceph.conf until next
nautilus release or should I wait for the next nautilus
release for this change? I mean does qa run on this value
for this config that we could trust and change it or should
we wait until the next nautilus release that qa ran on this
value?
On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov
<ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> wrote:
Hi Seena,
this parameter isn't intended to be adjusted in
production environments - it's supposed that default
behavior covers all regular customers' needs.
The issue though is that default setting is invalid. It
should be 'use_some_extra'. Gonna fix that shortly...
Thanks,
Igor
On 8/20/2020 1:44 PM, Seena Fallah wrote:
Hi Igor.
Could you please tell why this config is in LEVEL_DEV
(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
As it is documented in Ceph we can't use LEVEL_DEV in
production environments!
Thanks
On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov
<ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> wrote:
Hi Simon,
starting Nautlus v14.2.10 Bluestore is able to use
'wasted' space at DB
volume.
see this PR: https://github.com/ceph/ceph/pull/29687
Nice overview on the overall BlueFS/RocksDB design
can be find here:
https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
Which also includes some overview (as well as
additional concerns) for
changes brought by the above-mentioned PR.
Thanks,
Igor
On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> Hi Michael,
>
> thanks for the explanation! So if I understand
correctly, we waste 93
> GB per OSD on unused NVME space, because only
30GB is actually used...?
>
> And to improve the space for rocksdb, we need to
plan for 300GB per
> rocksdb partition in order to benefit from this
advantage....
>
> Reducing the number of small files is something
we always ask of our
> users, but reality is what it is ;-)
>
> I'll have to look into how I can get an
informative view on these
> metrics... It's pretty overwhelming the amount of
information coming
> out of the ceph cluster, even when you look only
superficially...
>
> Cheers,
>
> /Simon
>
> On 20/08/2020 10:16, Michael Bisig wrote:
>> Hi Simon
>>
>> As far as I know, RocksDB only uses "leveled"
space on the NVME
>> partition. The values are set to be 300MB, 3GB,
30GB and 300GB. Every
>> DB space above such a limit will automatically
end up on slow devices.
>> In your setup where you have 123GB per OSD that
means you only use
>> 30GB of fast device. The DB which spills over
this limit will be
>> offloaded to the HDD and accordingly, it slows
down requests and
>> compactions.
>>
>> You can proof what your OSD currently consumes with:
>> ceph daemon osd.X perf dump
>>
>> Informative values are `db_total_bytes`,
`db_used_bytes` and
>> `slow_used_bytes`. This changes regularly
because of the ongoing
>> compactions but Prometheus mgr module exports
these values such that
>> you can track it.
>>
>> Small files generally leads to bigger RocksDB,
especially when you
>> use EC, but this depends on the actual amount
and file sizes.
>>
>> I hope this helps.
>> Regards,
>> Michael
>>
>> On 20.08.20, 09:10, "Simon Oosthoek"
<s.oosthoek@xxxxxxxxxxxxx
<mailto:s.oosthoek@xxxxxxxxxxxxx>> wrote:
>>
>> Hi
>>
>> Recently our ceph cluster (nautilus) is
experiencing bluefs
>> spillovers,
>> just 2 osd's and I disabled the warning for
these osds.
>> (ceph config set osd.125
bluestore_warn_on_bluefs_spillover false)
>>
>> I'm wondering what causes this and how this
can be prevented.
>>
>> As I understand it the rocksdb for the OSD
needs to store more
>> than fits
>> on the NVME logical volume (123G for 12T
OSD). A way to fix it
>> could be
>> to increase the logical volume on the nvme
(if there was space
>> on the
>> nvme, which there isn't at the moment).
>>
>> This is the current size of the cluster and
how much is free:
>>
>> [root@cephmon1 ~]# ceph df
>> RAW STORAGE:
>> CLASS SIZE AVAIL USED RAW
USED
>> %RAW USED
>> hdd 1.8 PiB 842 TiB 974
TiB 974
>> TiB 53.63
>> TOTAL 1.8 PiB 842 TiB 974
TiB 974
>> TiB 53.63
>>
>> POOLS:
>> POOL ID STORED OBJECTS USED
>> %USED MAX AVAIL
>> cephfs_data 1 572 MiB
121.26M 2.4 GiB
>> 0 167 TiB
>> cephfs_metadata 2 56 GiB
5.15M 57 GiB
>> 0 167 TiB
>> cephfs_data_3copy 8 201 GiB
51.68k 602 GiB
>> 0.09 222 TiB
>> cephfs_data_ec83 13 643 TiB
279.75M 953 TiB
>> 58.86 485 TiB
>> rbd 14 21 GiB
5.66k 64 GiB
>> 0 222 TiB
>> .rgw.root 15 1.2 KiB 4 1 MiB
>> 0 167 TiB
>> default.rgw.control 16 0 B 8 0 B
>> 0 167 TiB
>> default.rgw.meta 17 765 B 4 1 MiB
>> 0 167 TiB
>> default.rgw.log 18 0 B
207 0 B
>> 0 167 TiB
>> cephfs_data_ec57 20 433 MiB
230 1.2 GiB
>> 0 278 TiB
>>
>> The amount used can still grow a bit before
we need to add
>> nodes, but
>> apparently we are running into the limits
of our rocskdb
>> partitions.
>>
>> Did we choose a parameter (e.g. minimal
object size) too small,
>> so we
>> have too much objects on these spillover
OSDs? Or is it that too
>> many
>> small files are stored on the cephfs
filesystems?
>>
>> When we expand the cluster, we can choose
larger nvme devices to
>> allow
>> larger rocksdb partitions, but is that the
right way to deal
>> with this,
>> or should we adjust some parameters on the
cluster that will
>> reduce the
>> rocksdb size?
>>
>> Cheers
>>
>> /Simon
>> _______________________________________________
>> ceph-users mailing list --
ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>> To unsubscribe send an email to
ceph-users-leave@xxxxxxx
<mailto:ceph-users-leave@xxxxxxx>
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to
ceph-users-leave@xxxxxxx
<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to
ceph-users-leave@xxxxxxx
<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx