> > Oh. > How would one recover from that? Sounds like it basically makes no difference if 2, 5 oder 10 OSD are in the blast radius. Veilicht. Aber a larger blast radius means that you lose a larger percentage of your cluster, assuming that you have a CRUSH failure domain of no smaller than `host`: - Reduced performance until repair - Longer repair process - If you’re using SATA drives, repair may saturate your HBA, which slows down both recovery and clients. - Depending on your Ceph release and configuration, your cluster may try to restore redundancy by making copies of surviving data on surviving nodes. Which may not have enough spare capacity, so their OSDs may enter nearfull, backfillfull, or even full states. Careful selection of the mon_osd_down_out_subtree_limit can forestall this when an entire host is down, with the tradeoff of reduced redundancy until the host is restored. - If your whole cluster is 30 OSDs, 10 being down is a whopping 1/3 of the whole. If it’s 1000 OSDs, that’s less of a concern. > Can the omap key/values be regenerated? > I always thought these data would be stored in the rgw pools. Or am I mixing things up and the bluestore metadata got omap k/v? And then there is the omap k/v from rgw objects? Doing so might take at least as much time, effort, and hassle as just repairing and backfilling the OSDs in toto, though there are multiple factors. This is one reason why I’ve long recommended all-flash clusters. Fewer interdependencies, less complexity, favorable blast radius, shorter MTTR. These contribute to TCO in very real ways. > > >> Am 10.11.2021 um 22:37 schrieb Сергей Процун <prosergey07@xxxxxxxxx>: >> >> >> No, you can not do that. Because RocksDB for omap key/values and WAL would be gone meaning all xattr and omap will be gone too. Hence osd will becom >> >> ceph-bluestore-tool bluefs-bdev-migrate --devs-source /var/lib/ceph/osd/ceph-OSD_ID/block --path /var/lib/ceph/osd/ceph-OSD_ID/ --dev-target /path/to/new/db_device >> >> ср, 10 лист. 2021, 11:51 користувач Boris Behrens <bb@xxxxxxxxx> пише: >>> Hi, >>> we use enterprise SSDs like SAMSUNG MZ7KM1T9. >>> The work very well for our block storage. Some NVMe would be a lot nicer but we have some good experience with them. >>> >>> One SSD fail takes down 10 OSDs might sound hard, but this would be an okayish risk. Most of the tunables are defaul in our setup and this looks like PGs have a failure domain of a host. I restart the systems on a regular basis for kernel updates. >>> Also checking disk io with dstat seems to be rather low on the SSDs (below 1k IOPs) >>> root@s3db18:~# dstat --disk --io -T -D sdd >>> --dsk/sdd-- ---io/sdd-- --epoch--- >>> read writ| read writ| epoch >>> 214k 1656k|7.21 126 |1636536603 >>> 144k 1176k|2.00 200 |1636536604 >>> 128k 1400k|2.00 230 |1636536605 >>> >>> Normaly I would now try this configuration: >>> 1 SSD / 10 OSDs - having 150GB of block.db and block.wal, both on the same partition as someone stated before, and 200GB extra to move all pools except the .data pool to SSDs. >>> >>> But thinking about 10 downed OSDs if one SSD fails let's me wonder how to recover from that. >>> IIRC the configuration per OSDs is in the LVM tags: >>> root@s3db18:~# lvs -o lv_tags >>> LV Tags >>> ceph.block_device=...,ceph.db_device=/dev/sdd8,ceph.db_uuid=011275a3-4201-8840-a678-c2e23d38bfd6,... >>> >>> When the SSD fails, can I just remove the tags and restart the OSD with ceph-volume lvm activate --all? And after replacing the failed SSD readd the tags with the correct IDs? Do I need to do anything else to prepare a block.db partition? >>> >>> Cheers >>> Boris >>> >>> >>>> Am Di., 9. Nov. 2021 um 22:15 Uhr schrieb prosergey07 <prosergey07@xxxxxxxxx>: >>>> Not sure how much it would help the performance with osd's backed with ssd db and wal devices. Even if you go this route with one ssd per 10 hdd, you might want to set the failure domain per host in crush rules in case ssd is out of service. >>>> >>>> But from the practice ssd will not help too much to boost the performance especially for sharing it between 10 hdds. >>>> >>>> We use nvme db+wal per osd and separate nvme specifically for metadata pools. There will be a lot of I/O on bucket.index pool and rgw pool which stores user, bucket metadata. So you might want to put them into separate fast storage. >>>> >>>> Also if there will not be too much objects, like huge objects but not tens-hundreds million of them then bucket index will have less presure and ssd might be okay for metadata pools in that case. >>>> >>>> >>>> >>>> Надіслано з пристрою Galaxy >>>> >>>> >>>> -------- Оригінальне повідомлення -------- >>>> Від: Boris Behrens <bb@xxxxxxxxx> >>>> Дата: 08.11.21 13:08 (GMT+02:00) >>>> Кому: ceph-users@xxxxxxx >>>> Тема: Question if WAL/block.db partition will benefit us >>>> >>>> Hi, >>>> we run a larger octopus s3 cluster with only rotating disks. >>>> 1.3 PiB with 177 OSDs, some with a SSD block.db and some without. >>>> >>>> We have a ton of spare 2TB disks and we just wondered if we can bring the >>>> to good use. >>>> For every 10 spinning disks we could add one 2TB SSD and we would create >>>> two partitions per OSD (130GB for block.db and 20GB for block.wal). This >>>> would leave some empty space on the SSD for waer leveling. >>>> >>>> The question now is: would we benefit from this? Most of the data that is >>>> written to the cluster is very large (50GB and above). This would take a >>>> lot of work into restructuring the cluster and also two other clusters. >>>> >>>> And does it make a different to have only a block.db partition or a >>>> block.db and a block.wal partition? >>>> >>>> Cheers >>>> Boris >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx