Re: OSD node memory sizing

Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx> · Thu, 19 May 2016 10:51:20 +0200

Hello,

On 05/19/2016 03:36 AM, Christian Balzer wrote:
> 
> Hello again,
> 
> On Wed, 18 May 2016 15:32:50 +0200 Dietmar Rieder wrote:
> 
>> Hello Christian,
>>
>>> Hello,
>>>
>>> On Wed, 18 May 2016 13:57:59 +0200 Dietmar Rieder wrote:
>>>
>>>> Dear Ceph users,
>>>>
>>>> I've a question regarding the memory recommendations for an OSD node.
>>>>
>>>> The official Ceph hardware recommendations say that an OSD node should
>>>> have 1GB Ram / TB OSD [1]
>>>>
>>>> The "Reference Architecture" whitpaper from Red Hat & Supermicro says
>>>> that "typically" 2GB of memory per OSD on a OSD node is used. [2]
>>>>
>>> This question has been asked and answered here countless times.
>>>
>>> Maybe something a bit more detailed ought to be placed in the first
>>> location, or simply a reference to the 2nd one. 
>>> But then again, that would detract from the RH added value.
>>
>> thanks for replying, nonetheless.
>> I checked the list before but I failed to find a definitive answer, may
>> be I was not looking hard enough. Anyway, thanks!
>>
> They tend to hidden sometimes in other threads, but there really is a lot..

It seems so, have to dig deeper into the available discussions...

> 
>>>  
>>>> According to the recommendation in [1] an OSD node with 24x 8TB OSD
>>>> disks is "underpowered "  when it is equipped with 128GB of RAM.
>>>> However, following the "recommendation" in [2] 128GB should be plenty
>>>> enough.
>>>>
>>> It's fine per se, the OSD processes will not consume all of that even
>>> in extreme situations.
>>
>> Ok, if I understood this correctly, then 128GB should be enough also
>> during rebalancing or backfilling.
>>
> Definitely, but realize that during this time of high memory consumption
> cause by backfilling your system is also under strain from objects moving
> in an out, so as per the high-density thread you will want all your dentry
> and other important SLAB objects to stay in RAM.
> 
> That's a lot of objects potentially with 8TB, so when choosing DIMMs pick
> ones that leave you with the option to go to 256GB later if need be.

Good point, I'll keep this in mind

> 
> Also you'll probably have loads of fun playing with CRUSH weights to keep
> the utilization of these 8TB OSDs within 100GB of each other. 

I'm afraid that  finding the "optimal" settings will demand a lot of
testing/playing

> 
>>>
>>> Very large OSDs and high density storage nodes have other issues and
>>> challenges, tuning and memory wise.
>>> There are several threads about these recently, including today.
>>
>> Thanks, I'll study these...
>>
>>>> I'm wondering which of the two is good enough for a Ceph cluster with
>>>> 10 nodes using EC (6+3)
>>>>
>>> I would spend more time pondering about the CPU power of these machines
>>> (EC need more) and what cache tier to get.
>>
>> We are planing to equip the OSD nodes with 2x2650v4 CPUs (24 cores @
>> 2.2GHz), that is 1 core/OSD. For the cache tier each OSD node gets two
>> 800Gb NVMe's. We hope this setup will give reasonable performance with
>> EC.
>>
> So you have actually 26 OSDs per node then.
> I'd say the CPUs are fine, but EC and the NVMes will eat a fair share of
> it.

Your right, it is 26 OSDs but still I assume that with these CPUs we
will not be completely underpowered.

> That's why I prefer to have dedicated cache tier nodes with fewer but
> faster cores, unless the cluster is going to be very large.
> With Hammer a 800GB DC S3160 SSD based OSD can easily saturate a 
> "E5-2623 v3" core @3.3GHz (nearly 2 cores to be precise) and Jewel has
> optimization that will both make it faster by itself AND enable it to
> use more CPU resources as well.
> 

That's probably, the best solution, but this will not be in our budged
and rackspace limits for the first setup, however when expanding later
on it will definitely be something to consider, also depending on the
performance that we obtain with this first setup.

> The NVMes (DC P3700 one presumes?) just for cache tiering, no SSD
> journals for the OSDs?

For now we have an offer for HPE  800GB NVMe MU (mixed use), 880MB/s
write 2600MB/s read, 3 DW/D. So they are a fast as the DC 3700, we will
probably check also other options.

> What are your network plans then, as in is your node storage bandwidth a
> good match for your network bandwidth? 
>

As network we will have 2x10GBit bonded cluster internal and 2x10GBit
bonded towards the clients, 1GBit for administration

>>> That is, if performance is a requirement in your use case.
>>
>> Always, who wouldn't care about performance?  :-)
>>
> "Good enough" sometimes really is good enough.
> 
> Since you're going for 8TB OSDs, EC and 10 nodes it feels that for you
> space is important, so something like archival, not RBD images for high
> performance VMs.
> 
> What is your use case?

You're right, space is most important. Our use case is not serving RBD
for VMs.
We will mainly store genomic data on cephfs volumes and access it from a
computing cluster
for analyis. This computing cluster is not very large (will grow) right
now It consists of 6 nodes and 288 cores.

Dietmar

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com