Re: Some more numbers - CPU/Memory suggestions for OSDs and Monitors

Christian Balzer <chibi@xxxxxxx> · Thu, 23 Apr 2015 10:17:03 +0900

On Wed, 22 Apr 2015 13:50:21 -0500 Mark Nelson wrote:

> 
> 
> On 04/22/2015 01:39 PM, Francois Lafont wrote:
> > Hi,
> >
> > Christian Balzer wrote:
> >
> >>> thanks for the feedback regarding the network questions. Currently I
> >>> try to solve the question of how much memory, cores and GHz for OSD
> >>> nodes and Monitors.
> >>>
> >>> My research so far:
> >>>
> >>> OSD nodes: 2 GB RAM, 2 GHz, 1 Core (?) per OSD
> >>>
> >> RAM is enough, but more helps (page cache on the storage node makes
> >> the reads of hot objects quite fast and prevents concurrent access to
> >> the disks).
> >
> > Personally, I have seen a different rule for the RAM: "1GB for each 1TB
> > of OSD daemons". This is I understand in this doc:
> >
> >      http://ceph.com/docs/master/start/hardware-recommendations/#minimum-hardware-recommendations
> >
> > So, for instance, with (it's just a stupid example):
> >
> > - 4 OSD daemons of 6TB and
> > - 5 OSD daemons of 1TB
> >
> > The needed RAM would be:
> >
> >      1GB x (4 x 6) + 1GB x (5 x 1) = 29GB for the RAM
> >
> > Is it correct? Because if I follow the "2GB RAM per OSD" rule, I just
> > need:
> >
> >      2GB x 9 = 18GB.
> 
> I'm not sure who came up with the 1GB for each 1TB of OSD daemons rule, 
> but frankly I don't think it scales well at the extremes.  You can't get 
> by with 256MB of ram for OSDs backed by 256GB SSDs, nor do you need 6GB 
> of ram per OSD for 6TB spinning disks.
> 
> 2-4GB of RAM per OSD is reasonable depending on how much page cache you 
> need.  I wouldn't stray outside of that range myself.
> 
What Mark said.

To back this up with some real world data, on my crappy test cluster I have
4 nodes with 4 OSDs (HDD only, 400GB but very little used), 4GB RAM and 4
cores each. 
The OSD processes are around 500MB, but when I stress this cluster,
especially with backfill operations, not only do they get bigger, the lack
of sufficient memory and page cache causes stuff starting to swap out which
of course makes things even worse.

The largest OSD processes I've seen are around 1GB for a 25TB OSD (yes,
that's right) which has about 2TB data in it.

All of this depends on your needs and budget as well as your configuration.
If you'd be using SSDs exclusively, the benefits of a larger page cache
wouldn't be as pronounced, but then again if you can afford a SSD cluster
you probably have money for more memory as well. ^o^

> >
> > Which rule is correct?
> >
> >> 1GHz or so for per pure HDD based OSD, at least 2GHz for HDD OSDs
> >> with SSD journals, as much as you can afford for entirely SSD based
> >> OSDs.
> >
> > Are there links about the "at least 2Ghz per OSD with SSD journal",
> > because I have never seen that except in this mailing list. For
> > instance in the "HARDWARE CONFIGURATION GUIDE" of Inktank, it is just
> > indicated: "one GHz per OSD" (https://ceph.com/category/resources/).
> >
I've given a concrete example of small write IOPS overwhelming 8 3.1 GHz
cores several times, others have shown similar things with SSD based
clusters.

> > Why should SSD journals increase the needed CPU?
> 
Because they make things _faster_ and now your cluster is capable of more
IOPS. 
But only if there is enough CPU power to actually handle all those
transactions.

Ceph has to execute an immense amount of code for a single I/O, with HDD
only clusters the disks are slow enough for pretty much any CPU to keep
up. In the test cluster mentioned above CPU is never the bottleneck.

> What it really comes down to is that your CPU needs to be fast enough to 
> process your workload.  Small IOs tend to be more CPU intensive than 
> large IOs.  Some processors have higher IPC than others so it's all just 
> kind of a vague guessing game.  With modern Intel XEON processors, 1GHz 
> of 1 core is a good general estimate.  If you are doing lots of small IO 
> with SSD backed OSDs you may need more.  If you are doing high 
> performance erasure coding you may need more.  If you have slow disks 
> with journals on disk, 3x replication, and a mostly read workload, you 
> may be able to get away with less.
> 
> As always, the recommendations above are just recommendations.  It's 
> best if you can test yourself.
> 
Well said.

With one production cluster here the load is such that most of the time CPU
cores are ramped down to half speed and OSD processes are only using 10%
of that capacity. But during certain processes (like deleting thousands of
files in a VM backed by a RBD image) things get really busy.

You have to ask yourself (and test if possible) what your cluster is
capable of (storage subsystem, network) and then match your CPUs in such a
way that they don't become the bottleneck, or at least not at a point that
makes all the money you've spent on SSDs, controllers and fast network
gear look like a waste (all cores 100% busy, but drives at 10-30%).

Christian
> 
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com