Re: Bluestore nvme DB/WAL size

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Fri, 21 Dec 2018 13:44:33 -0800

> It'll cause problems if yours the only one NVMe drive will die - you'll lost all the DB partitions and all the OSDs are going to be failed

The severity of this depends a lot on the size of the cluster.  If there are only, say, 4 nodes total, for sure the loss of a quarter of the OSDs will be somewhere between painful and fatal.  Especially if the subtree limit does not forestall rebalancing, and if EC is being used vs replication.  From a pain angle, though, this is no worse than if the server itself smokes.

It's easy to say "don't do that" but sometimes one doesn't have a choice:

* Unit economics can confound provisioning of larger/more external metadata devices.  I'm sure Vlad isn't using spinners because he hates SSDs.

* Devices have to go somewhere.  It's not uncommon to have a server using 2 PCIe slots for NICs (1) and another for an HBA, leaving as few as 1 or 0 free.  Sometimes the potential for a second PCI riser is canceled by the need to provision a rear drive cage for OS/boot drives to maximize front-panel bay availability.

* Cannibalizing one or more front drive bays for metadata drives can be problematic:
- Usable cluster capacity is decreased, along with unit economics
- Dogfood or draconian corporate policy (Herbert! Herbert!) can prohibit this.  I've personally in the past been prohibited from the obvious choise to use a simple open-market LFF to SFF adapter because it wasn't officially "supported" and would use components without a corporate SKU.

The 4% guidance was 1% until not all that long ago.  Guidance on calculating adequate sizing based on application and workload would be NTH.  I've been told that an object storage (RGW) use case can readily get away with less because L2/L3/etc are both rarely accessed and the first to be overflowed onto slower storage.  And that block (RBD) workloads have different access patterns that are more impacted by overflow of higher levels.  As RBD pools increasingly are deployed on SSD/NVMe devices, the case for colocating their metadata is strong, and obviates having to worry about sizing before deployment.

(1) One of many reasons to seriously consider not having a separate replication network

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com