Re: DB/WALL and RGW index on the same NVME

Daniel Parkes <dparkes@xxxxxxxxxx> · Mon, 8 Apr 2024 15:59:23 +0200

Hi,

Yes, that documentation you are linking is from Ceph 3.x with Filestore,
With Bluestore this is no longer the case, the link to the latest Red Hat
doc version is here:

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/7/html-single/object_gateway_guide/index#index-pool_rgw

I see they have this block of text there:

"For Red Hat Ceph Storage running Bluestore, Red Hat recommends deploying
an NVMe drive as a block.db device, rather than as a separate pool.
Ceph Object Gateway index data is written only into an object map (OMAP).
OMAP data for BlueStore resides on the block.db device on an OSD. When an
NVMe drive functions as a block.db device for an HDD OSD and when the index
pool is backed by HDD OSDs, the index data will ONLY be written to the
block.db device. As long as the block.db partition/lvm is sized properly at
4% of block, this configuration is all that is needed for BlueStore."

On Mon, Apr 8, 2024 at 12:02 PM Lukasz Borek <lukasz@xxxxxxxxxxxx> wrote:

> Thanks for clarifying.
>
> So redhat doc
> <https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html-single/ceph_object_gateway_for_production/index#adv-rgw-hardware-bucket-index>
> is outdated?
>
> 3.6. Selecting SSDs for Bucket Indexes
>
>
> When selecting OSD hardware for use with a Ceph Object
>> Gateway—irrespective of the use case—Red Hat recommends considering an OSD
>> node that has at least one SSD drive used exclusively for the bucket index
>> pool. This is particularly important when buckets will contain a large
>> number of objects.
>
>
> A bucket index entry is approximately 200 bytes of data, stored as an
>> object map (omap) in leveldb. While this is a trivial amount of data, some
>> uses of Ceph Object Gateway can result in tens or hundreds of millions of
>> objects in a single bucket. By mapping the bucket index pool to a CRUSH
>> hierarchy of SSD nodes, the reduced latency provides a dramatic performance
>> improvement when buckets contain very large numbers of objects.
>
>
>> Important
>> In a production cluster, a typical OSD node will have at least one SSD
>> for the bucket index, AND at least on SSD for the journal.
>
>
> Current utilisation is what osd df command shows in OMAP field?:
>
> root@cephbackup:/# ceph osd df
>> ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE   DATA     OMAP     META
>>     AVAIL    %USE   VAR   PGS  STATUS
>>  0    hdd   7.39870   1.00000  7.4 TiB   894 GiB  769 GiB  1.5 MiB  3.4
>> GiB  6.5 TiB  11.80  1.45   40      up
>>  1    hdd   7.39870   1.00000  7.4 TiB   703 GiB  578 GiB  6.0 MiB  2.9
>> GiB  6.7 TiB   9.27  1.14   37      up
>>  2    hdd   7.39870   1.00000  7.4 TiB   700 GiB  576 GiB  3.1 MiB  3.1
>> GiB  6.7 TiB   9.24  1.13   39      up
>
>
>
>
>
> On Mon, 8 Apr 2024 at 08:42, Daniel Parkes <dparkes@xxxxxxxxxx> wrote:
>
>> Hi Lukasz,
>>
>> RGW uses Omap objects for the index pool; Omaps are stored in Rocksdb
>> database of each osd, not on the actual index pool, so by putting DB/WALL
>> on an NVMe as you mentioned, you are already configuring the index pool on
>> a non-rotational drive, you don't need to do anything else.
>>
>> You just need to size your DB/WALL partition accordingly. For RGW/object
>> storage, a good starting point for the DB/Wall sizing is 4%.
>>
>> Example of Omap entries in the index pool using 0 bytes, as they are
>> stored in Rocksdb:
>>
>> # rados -p default.rgw.buckets.index listomapkeys .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
>> file1
>> file2
>> file4
>> file10
>>
>> rados df -p default.rgw.buckets.index
>> POOL_NAME                  USED  OBJECTS  CLONES  COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED  RD_OPS       RD  WR_OPS      WR  USED COMPR  UNDER COMPR
>> default.rgw.buckets.index   0 B       11       0      33                   0        0         0     208  207 KiB      41  20 KiB         0 B          0 B
>>
>> # rados -p default.rgw.buckets.index stat .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
>> default.rgw.buckets.index/.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2 mtime 2022-12-20T07:32:11.000000-0500, size 0
>>
>>
>> On Sun, Apr 7, 2024 at 10:06 PM Lukasz Borek <lukasz@xxxxxxxxxxxx> wrote:
>>
>>> Hi!
>>>
>>> I'm working on a POC cluster setup dedicated to backup app writing
>>> objects
>>> via s3 (large objects, up to 1TB transferred via multipart upload
>>> process).
>>>
>>> Initial setup is 18 storage nodes (12HDDs + 1 NVME card for DB/WALL) + EC
>>> pool.  Plan is to use cephadm.
>>>
>>> I'd like to follow good practice and put the RGW index pool on a
>>> no-rotation drive. Question is how to do it?
>>>
>>>    - replace a few HDDs (1 per node) with a SSD (how many? 4-6-8?)
>>>    - reserve space on NVME drive on each node, create lv based OSD and
>>> let
>>>    rgb index use the same NVME drive as DB/WALL
>>>
>>> Thoughts?
>>>
>>> --
>>> Lukasz
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>>
>
> --
> Łukasz Borek
> lukasz@xxxxxxxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx