Re: dense storage nodes

Wido den Hollander <wido@xxxxxxxx> · Wed, 18 May 2016 09:31:51 +0200 (CEST)

> Op 18 mei 2016 om 7:54 schreef Blair Bethwaite <blair.bethwaite@xxxxxxxxx>:
> 
> 
> Hi all,
> 
> What are the densest node configs out there, and what are your
> experiences with them and tuning required to make them work? If we can
> gather enough info here then I'll volunteer to propose some upstream
> docs covering this.
> 
> At Monash we currently have some 32-OSD nodes (running RHEL7), though
> 8 of those OSDs are not storing or doing much yet (in a quiet EC'd RGW
> pool), the other 24 OSDs are serving RBD and at perhaps 65% full on
> average - these are 4TB drives.
> 

I worked on a 256 OSD per node cluster (~2500 OSDs in total) and that didn't work out as hoped.

I got into this project when the hardware was already ordered, it wouldn't have been my choice.

> Aside from the already documented pid_max increases that are typically
> necessary just to start all OSDs, we've also had to up
> nf_conntrack_max. We've hit issues (twice now) that seem (have not

Why enable connection tracking at all? It only slows down Ceph traffic.

> figured out exactly how to confirm this yet) to be related to kernel
> dentry slab cache exhaustion - symptoms were a major slow down in
> performance and slow requests all over the place on writes, watching
> OSD iostat would show a single drive hitting 90+% util for ~15s with a
> bunch of small reads and no writes. These issues were worked around by
> tuning up filestore split and merge thresholds, though if we'd known
> about this earlier we'd probably have just bumped up the default
> object size so that we simply had fewer objects (and/or rounded up the
> PG count to the next power of 2). We also set vfs_cache_pressure to 1,
> though this didn't really seem to do much at the time. I've also seen
> recommendations about setting min_free_kbytes to something higher
> (currently 90112 on our hardware) but have not verified this.
> 

I eventually ended up doing NUMA pinning of OSDs and increasing pid_max, but that were most of the values.

The network didn't really need so much attention to make this work.

Wido

> -- 
> Cheers,
> ~Blairo
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com