2018-04-24 3:24 GMT+02:00 Christian Balzer <chibi@xxxxxxx>: > Hello, > Hi Christian, and thanks for your detailed answer. > On Mon, 23 Apr 2018 17:43:03 +0200 Florian Florensa wrote: > >> Hello everyone, >> >> I am in the process of designing a Ceph cluster, that will contain >> only SSD OSDs, and I was wondering how should I size my cpu. > Several threads about this around here, but first things first. > Any specifics about the storage needs, i.e. do you think you need the SSDs > for bandwidth or for IOPS reasons primarily? > Lots of smallish writes or large reads/writes? > >> The cluster will only be used for block storage. >> The OSDs will be Samsung PM863 (2Tb or 4Tb, this will be determined > > I assume PM863a, the non "a" model seems to be gone. > And that's a 1.3 DWPD drive, with a collocated journal or lots of small > writes and a collocated WAL/DB it will be half of that. > So run the numbers and make sure this is actually a good fit in the > endurance area. > Of course depending on your needs, journals or WAL/DB on higher endurance > NVMes might be a much better fit anyway. > Well, if it does makes sense and maybe a few NVMEs for WAL/DB would make sense, how many SSDs should i put on a single nvme ? And how should i size them, AFAIK its ~1.6%of drive capacity for WAL+DB using default values. >> when we will set the total volumetry in stone), and it will be in 2U >> 24SSDs servers > How many servers are you thinking about? > Because the fact that you're willing to double the SSD size but not the > number of servers suggests that you're thinking about a small number of > servers. > And while dense servers will save you space and money, more and smaller > servers are generally a better fit for Ceph, not least when considering > failure domains (a host typically). > The cluster should start somewhere between 20-30 OSDs nodes, and 3monitors. And it should grow in the forseeable future to up to 50 ODSs nodes, while keeping 3monitors, but that would be in a while (like 2-3years). Of course this number of node depends on the usability of storage and iops. The goal is to replace the SANs for Hypervisors and some diskless baremetal servers. >> Those server will probably be either Supermicro 2029U-E1CR4T or >> Supermicro 2028R-E1CR48L. >> I’ve read quite a lot of documentation regarding hardware choices, and >> I can’t find a ‘guideline’ for OSDs on SSD with colocated journal. > If this is a new cluster, that would be collocated WAL/DB and Bluestore. > Never mind my misgivings about Bluestore, at this point in time you > probably don't want to deploy a new cluster with filestore, unless you > have very specific needs and know what you're doing. > Yup the goal was to go for bluestore, as in an RBD workload it seems to be the better option (avoiding the write amplification and its induced latency) >> I was pointing for either dual ‘Xeon gold 6146’ or dual ‘Xeon 2699v4’ >> for the cpus, depending on the chassis. > The first one is a much better fit in terms of the "a fast core for each > OSD" philosophy needed for low latency and high IOPS. The 2nd is just > overkill, 24 real cores will do and for extreme cases I'm sure I can still > whip a fio setting that will saturate the 44 real cores of the 2nd setup. > Of course dual CPU configurations like this come with a potential latency > penalty for NUMA misses. > Should'nt i be able to 'pin' OSD daemon to specific CPU to avoid NUMA zone crossing ? > Unfortunately Supermicro didn't release my suggested Epyc based Ceph > storage node (yet?). > I was mentioning a single socket 1U (or 2U double) with 10 2.5 bays, with > up to 2 NVMe in those bays. > But even dual CPU Epyc based systems have a clear speed advantage when it > comes to NUMA misses due to the socket interconnect (Infinity Fabric). > > Do consider this alternative setup: > https://www.supermicro.com.tw/Aplus/system/1U/1123/AS-1123US-TR4.cfm > With either 8 SSDs and 2 NVMes or 10 SSDs and either > 2x Epyc 7251 (adequate core ratio and speed, cheap) or > 2x Epyc 7351 (massive overkill, but still 1/4 of the Intel price tag). > > The unreleased AS-2123US-TN24R25 with 2x Epyc 7351 might be a good fit as > well. > I was also considering Epyc, but i was considering using EC on my RBD pools to maximize the available capacity. Anyone has experiences using EC on RBD with or without Epyc cpus ? Also if i am able to go for single cpu chassis, that would decrease the electrical footprint of each node, thus allowing me to put more of them per rack. >> For the network part, I was thinking of using two Dual port connectx4 >> Lx from mellanox per servers. >> > Going to what kind of network/switches? > I was thinking of having up to 4*25Gb ethernet for each node going to a pair of switch to be able to withstand the 'loss' of a switch and a network card. >> If anyone has some ideas/thoughts/pointers, I would be glad to hear them. >> > RAM, you'll need a lot of it, even more with Bluestore given the current > caching. > I'd say 1GB per TB storage as usual and 1-2GB extra per OSD. > >From the latests documentation i read, i was under the impression that it was 16Gb for the 'OS' and 2Gb per OSD daemon, so for 24SSD it would have been 16GB + 48GB, rounded up to be confortable to 128GB, so with 24* 4Tb it would have been around the same value as your calculation, but i wonder which one is more suitable ? >> Regards, >> >> Florian >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > Christian Balzer Network/Systems Engineer > chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com