Re: dense storage nodes

Christian Balzer <chibi@xxxxxxx> · Wed, 18 May 2016 16:45:27 +0900

Hello,

On Wed, 18 May 2016 15:54:59 +1000 Blair Bethwaite wrote:

> Hi all,
> 
> What are the densest node configs out there, and what are your
> experiences with them and tuning required to make them work? If we can
> gather enough info here then I'll volunteer to propose some upstream
> docs covering this.
>
Not done anything denser than 24, but did spend a lot of time reading
about this here and talking with people who did as well as pondering what
I would do if the need arose in a future project.

I personally feel there is a point when scaling out (as in more OSDs)
turns from being beneficial (more IOPS/bandwidth) to one of diminishing
returns at best.

Depending on your needs and resources (like lukewarm storage or something
with a cache tier in front of it) you may fare better (and cheaper) with a
cluster that has a few dozen RAID backed OSDs (10 for IOPS, 60 for space
and cost) than hundreds of OSDs and the CPU/RAM resources for them.

For example I don't think you could get sufficient CPU power (or the
budget for it ^o^) to have 60 OSDs in a chassis like this, especially
with NVMe journals:
https://www.supermicro.com.tw/products/system/4U/6048/SSG-6048R-E1CR60N.cfm

That said, lets go to what I've done tuning wise.

> At Monash we currently have some 32-OSD nodes (running RHEL7), though
> 8 of those OSDs are not storing or doing much yet (in a quiet EC'd RGW
> pool), the other 24 OSDs are serving RBD and at perhaps 65% full on
> average - these are 4TB drives.
> 
> Aside from the already documented pid_max increases that are typically
> necessary just to start all OSDs, we've also had to up
> nf_conntrack_max. 

Running a local, stateful FW?

As Wido said, probably something to be avoided.

>We've hit issues (twice now) that seem (have not
> figured out exactly how to confirm this yet) to be related to kernel
> dentry slab cache exhaustion - symptoms were a major slow down in
> performance and slow requests all over the place on writes, watching
> OSD iostat would show a single drive hitting 90+% util for ~15s with a
> bunch of small reads and no writes. These issues were worked around by
> tuning up filestore split and merge thresholds, though if we'd known
> about this earlier we'd probably have just bumped up the default
> object size so that we simply had fewer objects 

It would be interesting to have a matrix of sorts how "expensive" an
object is, as we've seen examples that they indeed are (expensive that is).
However bumping up the object size strikes me as risky performance wise,
even more so when working with cache tiers.

>(and/or rounded up the
> PG count to the next power of 2). We also set vfs_cache_pressure to 1,
> though this didn't really seem to do much at the time. I've also seen
> recommendations about setting min_free_kbytes to something higher
> (currently 90112 on our hardware) but have not verified this.
> 
The latter will help with number of situations, especially if your network
card wants some buffers and can't get them. 
Mellanox Infiniband HCAs very much so (I use them everywhere), but it's
been known to help in other cases, too.

Here's my /etc/sysctl.d/tuning.conf for my storage nodes:
---
# Don't swap on these boxes
vm/swappiness = 1
vm/vfs_cache_pressure = 1
# Breathing space
vm/min_free_kbytes = 524288
# Room for a large cluster (mine isn't, consider 1048576)
kernel.pid_max = 65536
# And the connections that come with it (overkill for my cluster as well)
net.ipv4.ip_local_port_range = 11000 61000
#
# From here on it's cargo-culting, but well understood cargo-culting
# and gleaned from other Ceph users or similar use cases.
#
# Increase Linux autotuning TCP buffer limits
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE
# Don't set tcp_mem itself! Let the kernel scale it based on RAM.
net.core.rmem_max = 56623104
net.core.wmem_max = 56623104
net.core.rmem_default = 56623104
net.core.wmem_default = 56623104
net.core.optmem_max = 40960
# increase Linux autotuning TCP buffer limits 
# min, default, and max number of bytes to use
# (only change the 3rd value, and make it 16 MB or more)
net.ipv4.tcp_rmem = 65536 87380 56623104
net.ipv4.tcp_wmem = 65536 65536 56623104
# Make room for more TIME_WAIT sockets due to more clients,
# and allow them to be reused if we run out of sockets
# Also increase the max packet backlog
net.core.somaxconn = 1024
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 10
# Disable TCP slow start on idle connections
net.ipv4.tcp_slow_start_after_idle = 0
---

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com