Hello, On Tue, 24 Apr 2018 11:39:33 +0200 Florian Florensa wrote: > 2018-04-24 3:24 GMT+02:00 Christian Balzer <chibi@xxxxxxx>: > > Hello, > > > > Hi Christian, and thanks for your detailed answer. > > > On Mon, 23 Apr 2018 17:43:03 +0200 Florian Florensa wrote: > > > >> Hello everyone, > >> > >> I am in the process of designing a Ceph cluster, that will contain > >> only SSD OSDs, and I was wondering how should I size my cpu. > > Several threads about this around here, but first things first. > > Any specifics about the storage needs, i.e. do you think you need the SSDs > > for bandwidth or for IOPS reasons primarily? > > Lots of smallish writes or large reads/writes? > > > >> The cluster will only be used for block storage. > >> The OSDs will be Samsung PM863 (2Tb or 4Tb, this will be determined > > > > I assume PM863a, the non "a" model seems to be gone. > > And that's a 1.3 DWPD drive, with a collocated journal or lots of small > > writes and a collocated WAL/DB it will be half of that. > > So run the numbers and make sure this is actually a good fit in the > > endurance area. > > Of course depending on your needs, journals or WAL/DB on higher endurance > > NVMes might be a much better fit anyway. > > > > Well, if it does makes sense and maybe a few NVMEs for WAL/DB would > make sense, how many SSDs should i put on a single nvme ? > And how should i size them, AFAIK its ~1.6%of drive capacity for > WAL+DB using default values. > There've been numerous sizing threads here and the conclusion was unsurprising "it depends". That said, they will give you a good basis on where to start. Size will also depend on what kind of NVMe you can/will deploy, if you go with a smallish one that has n times the endurance for the SSDs behind it you will need to be more precise, if you can't get the endurance and compensate for it by a larger NVMe, you obviously can go all out with the WAL/DB size here. Given the number of hosts you plan to deploy initially, a ratio of up to 1:5 seems sensible. Of course the speed of the NVMe factors in here as well, not as predictable as with filestore journals, but still, esp. with regards to IOPS. Again, this depends on your use case, but a 1.3 DWPD endurance that could be as low as 0.65 DWPD strikes me as rather low. The SM variant might be a better fit if you go for a design w/o NVMes. > >> when we will set the total volumetry in stone), and it will be in 2U > >> 24SSDs servers > > How many servers are you thinking about? > > Because the fact that you're willing to double the SSD size but not the > > number of servers suggests that you're thinking about a small number of > > servers. > > And while dense servers will save you space and money, more and smaller > > servers are generally a better fit for Ceph, not least when considering > > failure domains (a host typically). > > > The cluster should start somewhere between 20-30 OSDs nodes, and 3monitors. > And it should grow in the forseeable future to up to 50 ODSs nodes, > while keeping 3monitors, but that would be in a while (like 2-3years). > Of course this number of node depends on the usability of storage and iops. > The goal is to replace the SANs for Hypervisors and some diskless baremetal > servers. > 20-30 OSD nodes with 24 SSDs, no wonder you didn't blink at the price tag of the Xeon 2699v4. Anyway, a decent number that will mediate the impact of of a node loss nicely. If IOPS are crucial, the aforementioned fast CPUs on top of fast storage are as well. > >> Those server will probably be either Supermicro 2029U-E1CR4T or > >> Supermicro 2028R-E1CR48L. > >> I’ve read quite a lot of documentation regarding hardware choices, and > >> I can’t find a ‘guideline’ for OSDs on SSD with colocated journal. > > If this is a new cluster, that would be collocated WAL/DB and Bluestore. > > Never mind my misgivings about Bluestore, at this point in time you > > probably don't want to deploy a new cluster with filestore, unless you > > have very specific needs and know what you're doing. > > > > Yup the goal was to go for bluestore, as in an RBD workload it seems to > be the better option (avoiding the write amplification and its induced latency) > Ah, but the filestore journal does _improve_ latency usually when on SSD. That's why for small writes the WAL/DB gets used with bluestore for journaling/coalescing as well, otherwise it would be slower than the same setup with filestore. And that is why you need to keep this little detail in mind when looking at endurance, DWPD figures. For larger writes bluestore wins out and does single writes indeed. > >> I was pointing for either dual ‘Xeon gold 6146’ or dual ‘Xeon 2699v4’ > >> for the cpus, depending on the chassis. > > The first one is a much better fit in terms of the "a fast core for each > > OSD" philosophy needed for low latency and high IOPS. The 2nd is just > > overkill, 24 real cores will do and for extreme cases I'm sure I can still > > whip a fio setting that will saturate the 44 real cores of the 2nd setup. > > Of course dual CPU configurations like this come with a potential latency > > penalty for NUMA misses. > > > > Should'nt i be able to 'pin' OSD daemon to specific CPU to avoid NUMA > zone crossing ? > Never mind that this is a righteous pain, with many if not all dual socket setups you _need_ to cross zones at some point to get to the HW, be it network or storage. The linux kernel does a pretty decent job already to keep stuff from moving around needlessly. > > Unfortunately Supermicro didn't release my suggested Epyc based Ceph > > storage node (yet?). > > I was mentioning a single socket 1U (or 2U double) with 10 2.5 bays, with > > up to 2 NVMe in those bays. > > But even dual CPU Epyc based systems have a clear speed advantage when it > > comes to NUMA misses due to the socket interconnect (Infinity Fabric). > > > > Do consider this alternative setup: > > https://www.supermicro.com.tw/Aplus/system/1U/1123/AS-1123US-TR4.cfm > > With either 8 SSDs and 2 NVMes or 10 SSDs and either > > 2x Epyc 7251 (adequate core ratio and speed, cheap) or > > 2x Epyc 7351 (massive overkill, but still 1/4 of the Intel price tag). > > > > The unreleased AS-2123US-TN24R25 with 2x Epyc 7351 might be a good fit as > > well. > > > > I was also considering Epyc, but i was considering using EC on my RBD pools to > maximize the available capacity. Anyone has experiences using EC on RBD with or > without Epyc cpus ? I wouldn't expect any significant differences to Intel here and the encoding part is something that should benefit from more (and faster) cores. However I'd suggest you do some testing and fact finding with regards to how EC impacts IOPS in general and how much of a write amplification it causes (very important for your endurance calculations). Having never played with EC, I can't help you here. > Also if i am able to go for single cpu chassis, that would decrease > the electrical footprint > of each node, thus allowing me to put more of them per rack. > Yup, but if you go for 24 SSDs per chassis that is unlikely, despite something like a Epyc 7551P being good enough for it. > >> For the network part, I was thinking of using two Dual port connectx4 > >> Lx from mellanox per servers. > >> > > Going to what kind of network/switches? > > > > I was thinking of having up to 4*25Gb ethernet for each node going to > a pair of switch > to be able to withstand the 'loss' of a switch and a network card. > With VLAG, I suppose? Anyway, sound enough. > >> If anyone has some ideas/thoughts/pointers, I would be glad to hear them. > >> > > RAM, you'll need a lot of it, even more with Bluestore given the current > > caching. > > I'd say 1GB per TB storage as usual and 1-2GB extra per OSD. > > > From the latests documentation i read, i was under the impression that > it was 16Gb > for the 'OS' and 2Gb per OSD daemon, so for 24SSD it would have been > 16GB + 48GB, > rounded up to be confortable to 128GB, so with 24* 4Tb it would have > been around the > same value as your calculation, but i wonder which one is more suitable ? > If in doubt, trust the documentation, unless it has been proven wrong or incomplete. At 128GB you're in a good spot either way. Christian > >> Regards, > >> > >> Florian > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Rakuten Communications > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com