Re: Question if WAL/block.db partition will benefit us

Сергей Процун <prosergey07@xxxxxxxxx> · Thu, 11 Nov 2021 00:32:58 +0200

In rgw.meta contains user, bucket, bucket instance metadata.

rgw.bucket.index contains bucket indexes aka shards. Like if you have 32
shards you will have 32 objects in that pool: .dir.BUCKET_ID.0-31. Each
would have part of your objects listed.They should be using some sort of
hash table algorithm and put into corresponding shard. Also objects in
bucket.index are of zero size. Their all data is OMAP stored in RocksDB.

 Then when we get the object name, we can check it in bucket.data pool. The
name of the object has prefix which is marker id of the bucket. So each rgw
object inside bucket.data pool also has OMAP  and xattr. And that data is
also in RocksDB. Like rgw.manifest xattr which contains manifest data. For
example if object is huge (takes more than 4MB) its stored as multiple
rados objects. Thats where from shadow files come from (pieces of one
bigger object). So losing DB device will make OSD non operational as OSD
bluestore uses DB device for storing omap and xattr.

ср, 10 лист. 2021, 23:51 користувач Boris <bb@xxxxxxxxx> пише:

> Oh.
> How would one recover from that? Sounds like it basically makes no
> difference if 2, 5 oder 10 OSD are in the blast radius.
>
> Can the omap key/values be regenerated?
> I always thought these data would be stored in the rgw pools. Or am I
> mixing things up and the bluestore metadata got omap k/v? And then there is
> the omap k/v from rgw objects?
>
>
> Am 10.11.2021 um 22:37 schrieb Сергей Процун <prosergey07@xxxxxxxxx>:
>
> 
> No, you can not do that. Because RocksDB for omap key/values and WAL would
> be gone meaning all xattr and omap will be gone too. Hence osd will become
> non operational. But if you notice that ssd starts throwing errors, you can
> start migrating bluefs device to a new partition:
>
> ceph-bluestore-tool bluefs-bdev-migrate --devs-source
> /var/lib/ceph/osd/ceph-OSD_ID/block --path /var/lib/ceph/osd/ceph-OSD_ID/
> --dev-target /path/to/new/db_device
>
> ср, 10 лист. 2021, 11:51 користувач Boris Behrens <bb@xxxxxxxxx> пише:
>
>> Hi,
>> we use enterprise SSDs like SAMSUNG MZ7KM1T9.
>> The work very well for our block storage. Some NVMe would be a lot nicer
>> but we have some good experience with them.
>>
>> One SSD fail takes down 10 OSDs might sound hard, but this would be an
>> okayish risk. Most of the tunables are defaul in our setup and this looks
>> like PGs have a failure domain of a host. I restart the systems on a
>> regular basis for kernel updates.
>> Also checking disk io with dstat seems to be rather low on the SSDs
>> (below 1k IOPs)
>> root@s3db18:~# dstat --disk --io  -T  -D sdd
>> --dsk/sdd-- ---io/sdd-- --epoch---
>>  read  writ| read  writ|  epoch
>>  214k 1656k|7.21   126 |1636536603
>>  144k 1176k|2.00   200 |1636536604
>>  128k 1400k|2.00   230 |1636536605
>>
>> Normaly I would now try this configuration:
>> 1 SSD / 10 OSDs - having 150GB of block.db and block.wal, both on the
>> same partition as someone stated before, and 200GB extra to move all pools
>> except the .data pool to SSDs.
>>
>> But thinking about 10 downed OSDs if one SSD fails let's me wonder how to
>> recover from that.
>> IIRC the configuration per OSDs is in the LVM tags:
>> root@s3db18:~# lvs -o lv_tags
>>   LV Tags
>>
>> ceph.block_device=...,ceph.db_device=/dev/sdd8,ceph.db_uuid=011275a3-4201-8840-a678-c2e23d38bfd6,...
>>
>> When the SSD fails, can I just remove the tags and restart the OSD with ceph-volume
>> lvm activate --all? And after replacing the failed SSD readd the tags
>> with the correct IDs? Do I need to do anything else to prepare a block.db
>> partition?
>>
>> Cheers
>>  Boris
>>
>>
>> Am Di., 9. Nov. 2021 um 22:15 Uhr schrieb prosergey07 <
>> prosergey07@xxxxxxxxx>:
>>
>>> Not sure how much it would help the performance with osd's backed with
>>> ssd db and wal devices. Even if you go this route with one ssd per 10 hdd,
>>> you might want to set the failure domain per host in crush rules in case
>>> ssd is out of service.
>>>
>>>  But from the practice ssd will not help too much to boost the
>>> performance especially for sharing it between 10 hdds.
>>>
>>>  We use nvme db+wal per osd and separate nvme specifically for metadata
>>> pools. There will be a lot of I/O on bucket.index pool and rgw pool which
>>> stores user, bucket metadata. So you might want to put them into separate
>>> fast storage.
>>>
>>>  Also if there will not be too much objects, like huge objects but not
>>> tens-hundreds million of them then bucket index will have less presure and
>>> ssd might be okay for metadata pools in that case.
>>>
>>>
>>>
>>> Надіслано з пристрою Galaxy
>>>
>>>
>>> -------- Оригінальне повідомлення --------
>>> Від: Boris Behrens <bb@xxxxxxxxx>
>>> Дата: 08.11.21 13:08 (GMT+02:00)
>>> Кому: ceph-users@xxxxxxx
>>> Тема:  Question if WAL/block.db partition will benefit us
>>>
>>> Hi,
>>> we run a larger octopus s3 cluster with only rotating disks.
>>> 1.3 PiB with 177 OSDs, some with a SSD block.db and some without.
>>>
>>> We have a ton of spare 2TB disks and we just wondered if we can bring the
>>> to good use.
>>> For every 10 spinning disks we could add one 2TB SSD and we would create
>>> two partitions per OSD (130GB for block.db and 20GB for block.wal). This
>>> would leave some empty space on the SSD for waer leveling.
>>>
>>> The question now is: would we benefit from this? Most of the data that is
>>> written to the cluster is very large (50GB and above). This would take a
>>> lot of work into restructuring the cluster and also two other clusters.
>>>
>>> And does it make a different to have only a block.db partition or a
>>> block.db and a block.wal partition?
>>>
>>> Cheers
>>> Boris
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx