Yeah . Wipe the disk, but do not remove it from ceph crush as it would result in re-balancing. Then recreate osd and let it re-join the cluster. чт, 11 лист. 2021, 11:05 користувач Boris Behrens <bb@xxxxxxxxx> пише: > Now I finally know what kind of data are stored in the RockzDB. Didn't > find it in the documentation. > This sounds like a horrible SPoF. How can you recover from it? Purge the > OSD, wipe the disk and readd it? > > All flash cluster is sadly not an option for our s3, as it is just too > large and we just bought around 60x 8TB Disks (in the last couple of > months). > > > Am Mi., 10. Nov. 2021 um 23:33 Uhr schrieb Сергей Процун < > prosergey07@xxxxxxxxx>: > >> In rgw.meta contains user, bucket, bucket instance metadata. >> >> rgw.bucket.index contains bucket indexes aka shards. Like if you have 32 >> shards you will have 32 objects in that pool: .dir.BUCKET_ID.0-31. Each >> would have part of your objects listed.They should be using some sort of >> hash table algorithm and put into corresponding shard. Also objects in >> bucket.index are of zero size. Their all data is OMAP stored in RocksDB. >> >> Then when we get the object name, we can check it in bucket.data pool. >> The name of the object has prefix which is marker id of the bucket. So each >> rgw object inside bucket.data pool also has OMAP and xattr. And that data >> is also in RocksDB. Like rgw.manifest xattr which contains manifest data. >> For example if object is huge (takes more than 4MB) its stored as multiple >> rados objects. Thats where from shadow files come from (pieces of one >> bigger object). So losing DB device will make OSD non operational as OSD >> bluestore uses DB device for storing omap and xattr. >> >> >> ср, 10 лист. 2021, 23:51 користувач Boris <bb@xxxxxxxxx> пише: >> >>> Oh. >>> How would one recover from that? Sounds like it basically makes no >>> difference if 2, 5 oder 10 OSD are in the blast radius. >>> >>> Can the omap key/values be regenerated? >>> I always thought these data would be stored in the rgw pools. Or am I >>> mixing things up and the bluestore metadata got omap k/v? And then there is >>> the omap k/v from rgw objects? >>> >>> >>> Am 10.11.2021 um 22:37 schrieb Сергей Процун <prosergey07@xxxxxxxxx>: >>> >>> >>> No, you can not do that. Because RocksDB for omap key/values and WAL >>> would be gone meaning all xattr and omap will be gone too. Hence osd will >>> become non operational. But if you notice that ssd starts throwing errors, >>> you can start migrating bluefs device to a new partition: >>> >>> ceph-bluestore-tool bluefs-bdev-migrate --devs-source >>> /var/lib/ceph/osd/ceph-OSD_ID/block --path /var/lib/ceph/osd/ceph-OSD_ID/ >>> --dev-target /path/to/new/db_device >>> >>> ср, 10 лист. 2021, 11:51 користувач Boris Behrens <bb@xxxxxxxxx> пише: >>> >>>> Hi, >>>> we use enterprise SSDs like SAMSUNG MZ7KM1T9. >>>> The work very well for our block storage. Some NVMe would be a lot >>>> nicer but we have some good experience with them. >>>> >>>> One SSD fail takes down 10 OSDs might sound hard, but this would be an >>>> okayish risk. Most of the tunables are defaul in our setup and this looks >>>> like PGs have a failure domain of a host. I restart the systems on a >>>> regular basis for kernel updates. >>>> Also checking disk io with dstat seems to be rather low on the SSDs >>>> (below 1k IOPs) >>>> root@s3db18:~# dstat --disk --io -T -D sdd >>>> --dsk/sdd-- ---io/sdd-- --epoch--- >>>> read writ| read writ| epoch >>>> 214k 1656k|7.21 126 |1636536603 >>>> 144k 1176k|2.00 200 |1636536604 >>>> 128k 1400k|2.00 230 |1636536605 >>>> >>>> Normaly I would now try this configuration: >>>> 1 SSD / 10 OSDs - having 150GB of block.db and block.wal, both on the >>>> same partition as someone stated before, and 200GB extra to move all pools >>>> except the .data pool to SSDs. >>>> >>>> But thinking about 10 downed OSDs if one SSD fails let's me wonder how >>>> to recover from that. >>>> IIRC the configuration per OSDs is in the LVM tags: >>>> root@s3db18:~# lvs -o lv_tags >>>> LV Tags >>>> >>>> ceph.block_device=...,ceph.db_device=/dev/sdd8,ceph.db_uuid=011275a3-4201-8840-a678-c2e23d38bfd6,... >>>> >>>> When the SSD fails, can I just remove the tags and restart the OSD with ceph-volume >>>> lvm activate --all? And after replacing the failed SSD readd the tags >>>> with the correct IDs? Do I need to do anything else to prepare a block.db >>>> partition? >>>> >>>> Cheers >>>> Boris >>>> >>>> >>>> Am Di., 9. Nov. 2021 um 22:15 Uhr schrieb prosergey07 < >>>> prosergey07@xxxxxxxxx>: >>>> >>>>> Not sure how much it would help the performance with osd's backed with >>>>> ssd db and wal devices. Even if you go this route with one ssd per 10 hdd, >>>>> you might want to set the failure domain per host in crush rules in case >>>>> ssd is out of service. >>>>> >>>>> But from the practice ssd will not help too much to boost the >>>>> performance especially for sharing it between 10 hdds. >>>>> >>>>> We use nvme db+wal per osd and separate nvme specifically for >>>>> metadata pools. There will be a lot of I/O on bucket.index pool and rgw >>>>> pool which stores user, bucket metadata. So you might want to put them into >>>>> separate fast storage. >>>>> >>>>> Also if there will not be too much objects, like huge objects but not >>>>> tens-hundreds million of them then bucket index will have less presure and >>>>> ssd might be okay for metadata pools in that case. >>>>> >>>>> >>>>> >>>>> Надіслано з пристрою Galaxy >>>>> >>>>> >>>>> -------- Оригінальне повідомлення -------- >>>>> Від: Boris Behrens <bb@xxxxxxxxx> >>>>> Дата: 08.11.21 13:08 (GMT+02:00) >>>>> Кому: ceph-users@xxxxxxx >>>>> Тема: Question if WAL/block.db partition will benefit us >>>>> >>>>> Hi, >>>>> we run a larger octopus s3 cluster with only rotating disks. >>>>> 1.3 PiB with 177 OSDs, some with a SSD block.db and some without. >>>>> >>>>> We have a ton of spare 2TB disks and we just wondered if we can bring >>>>> the >>>>> to good use. >>>>> For every 10 spinning disks we could add one 2TB SSD and we would >>>>> create >>>>> two partitions per OSD (130GB for block.db and 20GB for block.wal). >>>>> This >>>>> would leave some empty space on the SSD for waer leveling. >>>>> >>>>> The question now is: would we benefit from this? Most of the data that >>>>> is >>>>> written to the cluster is very large (50GB and above). This would take >>>>> a >>>>> lot of work into restructuring the cluster and also two other clusters. >>>>> >>>>> And does it make a different to have only a block.db partition or a >>>>> block.db and a block.wal partition? >>>>> >>>>> Cheers >>>>> Boris >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> >>>> >>>> > > -- > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im > groüen Saal. > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx