Re: Question if WAL/block.db partition will benefit us

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



No, you can not do that. Because RocksDB for omap key/values and WAL would
be gone meaning all xattr and omap will be gone too. Hence osd will become
non operational. But if you notice that ssd starts throwing errors, you can
start migrating bluefs device to a new partition:

ceph-bluestore-tool bluefs-bdev-migrate --devs-source
/var/lib/ceph/osd/ceph-OSD_ID/block --path /var/lib/ceph/osd/ceph-OSD_ID/
--dev-target /path/to/new/db_device

ср, 10 лист. 2021, 11:51 користувач Boris Behrens <bb@xxxxxxxxx> пише:

> Hi,
> we use enterprise SSDs like SAMSUNG MZ7KM1T9.
> The work very well for our block storage. Some NVMe would be a lot nicer
> but we have some good experience with them.
>
> One SSD fail takes down 10 OSDs might sound hard, but this would be an
> okayish risk. Most of the tunables are defaul in our setup and this looks
> like PGs have a failure domain of a host. I restart the systems on a
> regular basis for kernel updates.
> Also checking disk io with dstat seems to be rather low on the SSDs (below
> 1k IOPs)
> root@s3db18:~# dstat --disk --io  -T  -D sdd
> --dsk/sdd-- ---io/sdd-- --epoch---
>  read  writ| read  writ|  epoch
>  214k 1656k|7.21   126 |1636536603
>  144k 1176k|2.00   200 |1636536604
>  128k 1400k|2.00   230 |1636536605
>
> Normaly I would now try this configuration:
> 1 SSD / 10 OSDs - having 150GB of block.db and block.wal, both on the same
> partition as someone stated before, and 200GB extra to move all pools
> except the .data pool to SSDs.
>
> But thinking about 10 downed OSDs if one SSD fails let's me wonder how to
> recover from that.
> IIRC the configuration per OSDs is in the LVM tags:
> root@s3db18:~# lvs -o lv_tags
>   LV Tags
>
> ceph.block_device=...,ceph.db_device=/dev/sdd8,ceph.db_uuid=011275a3-4201-8840-a678-c2e23d38bfd6,...
>
> When the SSD fails, can I just remove the tags and restart the OSD with ceph-volume
> lvm activate --all? And after replacing the failed SSD readd the tags
> with the correct IDs? Do I need to do anything else to prepare a block.db
> partition?
>
> Cheers
>  Boris
>
>
> Am Di., 9. Nov. 2021 um 22:15 Uhr schrieb prosergey07 <
> prosergey07@xxxxxxxxx>:
>
>> Not sure how much it would help the performance with osd's backed with
>> ssd db and wal devices. Even if you go this route with one ssd per 10 hdd,
>> you might want to set the failure domain per host in crush rules in case
>> ssd is out of service.
>>
>>  But from the practice ssd will not help too much to boost the
>> performance especially for sharing it between 10 hdds.
>>
>>  We use nvme db+wal per osd and separate nvme specifically for metadata
>> pools. There will be a lot of I/O on bucket.index pool and rgw pool which
>> stores user, bucket metadata. So you might want to put them into separate
>> fast storage.
>>
>>  Also if there will not be too much objects, like huge objects but not
>> tens-hundreds million of them then bucket index will have less presure and
>> ssd might be okay for metadata pools in that case.
>>
>>
>>
>> Надіслано з пристрою Galaxy
>>
>>
>> -------- Оригінальне повідомлення --------
>> Від: Boris Behrens <bb@xxxxxxxxx>
>> Дата: 08.11.21 13:08 (GMT+02:00)
>> Кому: ceph-users@xxxxxxx
>> Тема:  Question if WAL/block.db partition will benefit us
>>
>> Hi,
>> we run a larger octopus s3 cluster with only rotating disks.
>> 1.3 PiB with 177 OSDs, some with a SSD block.db and some without.
>>
>> We have a ton of spare 2TB disks and we just wondered if we can bring the
>> to good use.
>> For every 10 spinning disks we could add one 2TB SSD and we would create
>> two partitions per OSD (130GB for block.db and 20GB for block.wal). This
>> would leave some empty space on the SSD for waer leveling.
>>
>> The question now is: would we benefit from this? Most of the data that is
>> written to the cluster is very large (50GB and above). This would take a
>> lot of work into restructuring the cluster and also two other clusters.
>>
>> And does it make a different to have only a block.db partition or a
>> block.db and a block.wal partition?
>>
>> Cheers
>> Boris
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux