> Been reading "Learning Ceph - Second Edition” An outstanding book, I must say ;) > So can I get by with using a single SATA SSD (size?) per server for RocksDB / WAL if I'm using Bluestore? Depends on the rest of your setup and use-case, but I think this would be a bottleneck. Some thoughts: * You wrote that your servers have 1x 240GB SATA SSD that has the OS, and 8x 2TB SATA OSD drives. ** Sharing the OS with journal/metadata could lead to contention between the two ** Since the OS has been doing who-knows-what with that drive, check the lifetime used/remaining with `smartctl -a`. ** If they’ve been significantly consumed, their lifetime with the load Ceph will present will be limited. ** SSDs selected for OS/boot drives often have relatively low durability (DWPD) and may suffer performance cliffing when given a steady load. Look up the specs on your model ** 8 OSDs sharing a single SSD for metadata is a very large failure domain. If/when you lose that SSD, your lose all 8 OSDs and the host itself. You would want to set the subtree limit to “host”, and not fill the OSDs past, say, 60% so that you’d have room to backfill in case of a failure not caught by the subtree limit ** 8 HDD OSDs sharing a single SATA SSD for metadata will be bottlenecked unless your workload is substantially reads. * Single SATA HDD on the mons ** When it fails, you lose the mon ** I have personally seen issues due to HDDs not handling peak demands, resulting in an outage The gear you list is fairly old and underpowered, but sure you could use it *as a PoC*. For a production deployment you’d want different hardware. > - Is putting the journal on a partition of the SATA drives a real I/O killer? (this is how my Proxmox boxes are set up) With Filestore and HDDs, absolutely. Even worse if you were to use EC. There may be some coalescing of ops, but you’re still going to get a *lot* of long seeks, and spinners can only do a certain number of IOPs. I think in the book I described personal experience with such a setup that even tickled a design flaw on the part of a certain HDD vendor. Eventually I was permitted to get journal devices (this was pre-BlueStore GA), which were PCIe NVMe. Write performance doubled. Then we hit a race condition / timing issue in nvme.ko, but I digress... When using SATA *SSD*s for OSDs, you have no seeks of course, and colocating the journals/metadata is more viable. > - If YES to the above, then is a SATA SSD acceptable for journal device, or should I definitely consider PCIe SSD? (I'd have to limit to one per server, which I know isn't optimal, but price prevents otherwise…) Optanes for these systems would be overkill. If you would plan to have the PoC cluster run any appreciable load for any length of time, I might suggest instead adding 2x SATA SSDs per, so you could map 4x OSDs to each. These would not need to be large: upstream party line would have you allocate 80GB on each, though depending on your use-case you might well do fine with less, 2x 240GB class or even 2x 120GB class should suffice for PoC service. For production I would advise “enterprise” class drives with at least 1 DWPD durability — recently we’ve seen a certain vendor weasel their durability by computing it incorrectly. Depending on what you expect out of your PoC, and especially assuming you use BlueStore, you might get away with colocation, but do not expect performance that can be extrapolated for a production deployment. With the NICs you have, you could keep it simple and skip a replication/back end network altogether, or you could bond the LOM ports and split them. Whatever’s simplest with the network infrastructure you have. For production you’d most likely want LACP bonded NICs, but if the network tech is modern, skipping the replication network may be very feasible, But I’m ahead of your context … HTH — aad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com