Re: dense storage nodes

George Mihaiescu <lmihaiescu@xxxxxxxxx> · Wed, 18 May 2016 15:38:54 -0400

Hi Blair,

We use 36 OSDs nodes with journals on HDD running in a 90% object storage cluster. 
The
 servers have 128 GB RAM and 40 cores (HT) for the storage nodes with 4 TB 
SAS drives, and 256 GB and 48 cores for the storage nodes with 6 TB SAS 
drives.
We use 2x10 Gb bonded for the client network, and  2x10 Gb bonded 
for the replication traffic. The drives are 7.2K RPM, 12 GB SAS drives 
connected to LSI 9300-8i 12 Gb/s.

We increased read-ahead 
on the drives to 8192, and we are using 64 MB for rgw_obj_stripe_size 
because of the specific workload needs.
We have MTU 9000 
set on the interfaces and using the "bond-xmit_hash_policy layer3+4" to 
better distribute the traffic among the physical links.

The cluster has now 3.2 PB raw (50% used), and performs really well, with no resources being strained.

Cheers,
George

On Wed, May 18, 2016 at 1:54 AM, Blair Bethwaite <blair.bethwaite@xxxxxxxxx> wrote:
Hi all,

What are the densest node configs out there, and what are your

experiences with them and tuning required to make them work? If we can

gather enough info here then I'll volunteer to propose some upstream

docs covering this.

At Monash we currently have some 32-OSD nodes (running RHEL7), though

8 of those OSDs are not storing or doing much yet (in a quiet EC'd RGW

pool), the other 24 OSDs are serving RBD and at perhaps 65% full on

average - these are 4TB drives.

Aside from the already documented pid_max increases that are typically

necessary just to start all OSDs, we've also had to up

nf_conntrack_max. We've hit issues (twice now) that seem (have not

figured out exactly how to confirm this yet) to be related to kernel

dentry slab cache exhaustion - symptoms were a major slow down in

performance and slow requests all over the place on writes, watching

OSD iostat would show a single drive hitting 90+% util for ~15s with a

bunch of small reads and no writes. These issues were worked around by

tuning up filestore split and merge thresholds, though if we'd known

about this earlier we'd probably have just bumped up the default

object size so that we simply had fewer objects (and/or rounded up the

PG count to the next power of 2). We also set vfs_cache_pressure to 1,

though this didn't really seem to do much at the time. I've also seen

recommendations about setting min_free_kbytes to something higher

(currently 90112 on our hardware) but have not verified this.

--

Cheers,

~Blairo

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com