Re: Questions regarding hardware design of an SSD only cluster

Florian Florensa <florian@xxxxxxxxxxx> · Tue, 24 Apr 2018 11:39:33 +0200

2018-04-24 3:24 GMT+02:00 Christian Balzer <chibi@xxxxxxx>:
> Hello,
>

Hi Christian, and thanks for your detailed answer.

> On Mon, 23 Apr 2018 17:43:03 +0200 Florian Florensa wrote:
>
>> Hello everyone,
>>
>> I am in the process of designing a Ceph cluster, that will contain
>> only SSD OSDs, and I was wondering how should I size my cpu.
> Several threads about this around here, but first things first.
> Any specifics about the storage needs, i.e. do you think you need the SSDs
> for bandwidth or for IOPS reasons primarily?
> Lots of smallish writes or large reads/writes?
>
>> The cluster will only be used for block storage.
>> The OSDs will be Samsung PM863 (2Tb or 4Tb, this will be determined
>
> I assume PM863a, the non "a" model seems to be gone.
> And that's a 1.3 DWPD drive, with a collocated journal or lots of small
> writes and a collocated WAL/DB it will be half of that.
> So run the numbers and make sure this is actually a good fit in the
> endurance area.
> Of course depending on your needs, journals or WAL/DB on higher endurance
> NVMes might be a much better fit anyway.
>

Well, if it does makes sense and maybe a few NVMEs for WAL/DB would
make sense, how many SSDs should i put on a single nvme ?
And how should i size them, AFAIK its ~1.6%of drive capacity for
WAL+DB using default values.

>> when we will set the total volumetry in stone), and it will be in 2U
>> 24SSDs servers
> How many servers are you thinking about?
> Because the fact that you're willing to double the SSD size but not the
> number of servers suggests that you're thinking about a small number of
> servers.
> And while dense servers will save you space and money, more and smaller
> servers are generally a better fit for Ceph, not least when considering
> failure domains (a host typically).
>
The cluster should start somewhere between 20-30 OSDs nodes, and 3monitors.
And it should grow in the forseeable future to up to 50 ODSs nodes,
while keeping 3monitors, but that would be in a while (like 2-3years).
Of course this number of node depends on the usability of storage and iops.
The goal is to replace the SANs for Hypervisors and some diskless baremetal
servers.

>> Those server will probably be either Supermicro 2029U-E1CR4T or
>> Supermicro 2028R-E1CR48L.
>> I’ve read quite a lot of documentation regarding hardware choices, and
>> I can’t find a ‘guideline’ for OSDs on SSD with colocated journal.
> If this is a new cluster, that would be collocated WAL/DB and Bluestore.
> Never mind my misgivings about Bluestore, at this point in time you
> probably don't want to deploy a new cluster with filestore, unless you
> have very specific needs and know what you're doing.
>

Yup the goal was to go for bluestore, as in an RBD workload it seems to
be the better option (avoiding the write amplification and its induced latency)

>> I was pointing for either dual ‘Xeon gold 6146’ or dual ‘Xeon 2699v4’
>> for the cpus, depending on the chassis.
> The first one is a much better fit in terms of the "a fast core for each
> OSD" philosophy needed for low latency and high IOPS. The 2nd is just
> overkill, 24 real cores will do and for extreme cases I'm sure I can still
> whip a fio setting that will saturate the 44 real cores of the 2nd setup.
> Of course dual CPU configurations like this come with a potential latency
> penalty for NUMA misses.
>

Should'nt i be able to 'pin' OSD daemon to specific CPU to avoid NUMA
zone crossing ?

> Unfortunately Supermicro didn't release my suggested Epyc based Ceph
> storage node (yet?).
> I was mentioning a single socket 1U (or 2U double) with 10 2.5 bays, with
> up to 2 NVMe in those bays.
> But even dual CPU Epyc based systems have a clear speed advantage when it
> comes to NUMA misses due to the socket interconnect (Infinity Fabric).
>
> Do consider this alternative setup:
> https://www.supermicro.com.tw/Aplus/system/1U/1123/AS-1123US-TR4.cfm
> With either 8 SSDs and 2 NVMes or 10 SSDs and either
> 2x Epyc 7251 (adequate core ratio and speed, cheap) or
> 2x Epyc 7351 (massive overkill, but still 1/4 of the Intel price tag).
>
> The unreleased AS-2123US-TN24R25 with 2x Epyc 7351 might be a good fit as
> well.
>

I was also considering Epyc, but i was considering using EC on my RBD pools to
maximize the available capacity. Anyone has experiences using EC on RBD with or
without Epyc cpus ?
Also if i am able to go for single cpu chassis, that would decrease
the electrical footprint
of each node, thus allowing me to put more of them per rack.

>> For the network part, I was thinking of using two Dual port connectx4
>> Lx from mellanox per servers.
>>
> Going to what kind of network/switches?
>

I was thinking of having up to 4*25Gb ethernet for each node going to
a pair of switch
to be able to withstand the 'loss' of a switch and a network card.

>> If anyone has some ideas/thoughts/pointers, I would be glad to hear them.
>>
> RAM, you'll need a lot of it, even more with Bluestore given the current
> caching.
> I'd say 1GB per TB storage as usual and 1-2GB extra per OSD.
>
>From the latests documentation i read, i was under the impression that
it was 16Gb
for the 'OS' and 2Gb per OSD daemon, so for 24SSD it would have been
16GB + 48GB,
rounded up to be confortable to 128GB, so with 24* 4Tb it would have
been around the
same value as your calculation, but i wonder which one is more suitable ?

>> Regards,
>>
>> Florian
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com