Re: Question if WAL/block.db partition will benefit us

Сергей Процун <prosergey07@xxxxxxxxx> · Thu, 11 Nov 2021 18:48:03 +0200

Yeah . Wipe the disk, but do not remove it from ceph crush as it would
result in re-balancing. Then recreate osd and let it re-join the cluster.

чт, 11 лист. 2021, 11:05 користувач Boris Behrens <bb@xxxxxxxxx> пише:

> Now I finally know what kind of data are stored in the RockzDB. Didn't
> find it in the documentation.
> This sounds like a horrible SPoF. How can you recover from it? Purge the
> OSD, wipe the disk and readd it?
>
> All flash cluster is sadly not an option for our s3, as it is just too
> large and we just bought around 60x 8TB Disks (in the last couple of
> months).
>
>
> Am Mi., 10. Nov. 2021 um 23:33 Uhr schrieb Сергей Процун <
> prosergey07@xxxxxxxxx>:
>
>> In rgw.meta contains user, bucket, bucket instance metadata.
>>
>> rgw.bucket.index contains bucket indexes aka shards. Like if you have 32
>> shards you will have 32 objects in that pool: .dir.BUCKET_ID.0-31. Each
>> would have part of your objects listed.They should be using some sort of
>> hash table algorithm and put into corresponding shard. Also objects in
>> bucket.index are of zero size. Their all data is OMAP stored in RocksDB.
>>
>>  Then when we get the object name, we can check it in bucket.data pool.
>> The name of the object has prefix which is marker id of the bucket. So each
>> rgw object inside bucket.data pool also has OMAP  and xattr. And that data
>> is also in RocksDB. Like rgw.manifest xattr which contains manifest data.
>> For example if object is huge (takes more than 4MB) its stored as multiple
>> rados objects. Thats where from shadow files come from (pieces of one
>> bigger object). So losing DB device will make OSD non operational as OSD
>> bluestore uses DB device for storing omap and xattr.
>>
>>
>> ср, 10 лист. 2021, 23:51 користувач Boris <bb@xxxxxxxxx> пише:
>>
>>> Oh.
>>> How would one recover from that? Sounds like it basically makes no
>>> difference if 2, 5 oder 10 OSD are in the blast radius.
>>>
>>> Can the omap key/values be regenerated?
>>> I always thought these data would be stored in the rgw pools. Or am I
>>> mixing things up and the bluestore metadata got omap k/v? And then there is
>>> the omap k/v from rgw objects?
>>>
>>>
>>> Am 10.11.2021 um 22:37 schrieb Сергей Процун <prosergey07@xxxxxxxxx>:
>>>
>>> 
>>> No, you can not do that. Because RocksDB for omap key/values and WAL
>>> would be gone meaning all xattr and omap will be gone too. Hence osd will
>>> become non operational. But if you notice that ssd starts throwing errors,
>>> you can start migrating bluefs device to a new partition:
>>>
>>> ceph-bluestore-tool bluefs-bdev-migrate --devs-source
>>> /var/lib/ceph/osd/ceph-OSD_ID/block --path /var/lib/ceph/osd/ceph-OSD_ID/
>>> --dev-target /path/to/new/db_device
>>>
>>> ср, 10 лист. 2021, 11:51 користувач Boris Behrens <bb@xxxxxxxxx> пише:
>>>
>>>> Hi,
>>>> we use enterprise SSDs like SAMSUNG MZ7KM1T9.
>>>> The work very well for our block storage. Some NVMe would be a lot
>>>> nicer but we have some good experience with them.
>>>>
>>>> One SSD fail takes down 10 OSDs might sound hard, but this would be an
>>>> okayish risk. Most of the tunables are defaul in our setup and this looks
>>>> like PGs have a failure domain of a host. I restart the systems on a
>>>> regular basis for kernel updates.
>>>> Also checking disk io with dstat seems to be rather low on the SSDs
>>>> (below 1k IOPs)
>>>> root@s3db18:~# dstat --disk --io  -T  -D sdd
>>>> --dsk/sdd-- ---io/sdd-- --epoch---
>>>>  read  writ| read  writ|  epoch
>>>>  214k 1656k|7.21   126 |1636536603
>>>>  144k 1176k|2.00   200 |1636536604
>>>>  128k 1400k|2.00   230 |1636536605
>>>>
>>>> Normaly I would now try this configuration:
>>>> 1 SSD / 10 OSDs - having 150GB of block.db and block.wal, both on the
>>>> same partition as someone stated before, and 200GB extra to move all pools
>>>> except the .data pool to SSDs.
>>>>
>>>> But thinking about 10 downed OSDs if one SSD fails let's me wonder how
>>>> to recover from that.
>>>> IIRC the configuration per OSDs is in the LVM tags:
>>>> root@s3db18:~# lvs -o lv_tags
>>>>   LV Tags
>>>>
>>>> ceph.block_device=...,ceph.db_device=/dev/sdd8,ceph.db_uuid=011275a3-4201-8840-a678-c2e23d38bfd6,...
>>>>
>>>> When the SSD fails, can I just remove the tags and restart the OSD with ceph-volume
>>>> lvm activate --all? And after replacing the failed SSD readd the tags
>>>> with the correct IDs? Do I need to do anything else to prepare a block.db
>>>> partition?
>>>>
>>>> Cheers
>>>>  Boris
>>>>
>>>>
>>>> Am Di., 9. Nov. 2021 um 22:15 Uhr schrieb prosergey07 <
>>>> prosergey07@xxxxxxxxx>:
>>>>
>>>>> Not sure how much it would help the performance with osd's backed with
>>>>> ssd db and wal devices. Even if you go this route with one ssd per 10 hdd,
>>>>> you might want to set the failure domain per host in crush rules in case
>>>>> ssd is out of service.
>>>>>
>>>>>  But from the practice ssd will not help too much to boost the
>>>>> performance especially for sharing it between 10 hdds.
>>>>>
>>>>>  We use nvme db+wal per osd and separate nvme specifically for
>>>>> metadata pools. There will be a lot of I/O on bucket.index pool and rgw
>>>>> pool which stores user, bucket metadata. So you might want to put them into
>>>>> separate fast storage.
>>>>>
>>>>>  Also if there will not be too much objects, like huge objects but not
>>>>> tens-hundreds million of them then bucket index will have less presure and
>>>>> ssd might be okay for metadata pools in that case.
>>>>>
>>>>>
>>>>>
>>>>> Надіслано з пристрою Galaxy
>>>>>
>>>>>
>>>>> -------- Оригінальне повідомлення --------
>>>>> Від: Boris Behrens <bb@xxxxxxxxx>
>>>>> Дата: 08.11.21 13:08 (GMT+02:00)
>>>>> Кому: ceph-users@xxxxxxx
>>>>> Тема:  Question if WAL/block.db partition will benefit us
>>>>>
>>>>> Hi,
>>>>> we run a larger octopus s3 cluster with only rotating disks.
>>>>> 1.3 PiB with 177 OSDs, some with a SSD block.db and some without.
>>>>>
>>>>> We have a ton of spare 2TB disks and we just wondered if we can bring
>>>>> the
>>>>> to good use.
>>>>> For every 10 spinning disks we could add one 2TB SSD and we would
>>>>> create
>>>>> two partitions per OSD (130GB for block.db and 20GB for block.wal).
>>>>> This
>>>>> would leave some empty space on the SSD for waer leveling.
>>>>>
>>>>> The question now is: would we benefit from this? Most of the data that
>>>>> is
>>>>> written to the cluster is very large (50GB and above). This would take
>>>>> a
>>>>> lot of work into restructuring the cluster and also two other clusters.
>>>>>
>>>>> And does it make a different to have only a block.db partition or a
>>>>> block.db and a block.wal partition?
>>>>>
>>>>> Cheers
>>>>> Boris
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>>
>>>>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groÃƒ¼en Saal.
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx