Re: dense storage nodes

Benjeman Meekhof <bmeekhof@xxxxxxxxx> · Wed, 18 May 2016 10:36:33 -0400

We're in process of tuning a cluster that currently consists of 3
dense nodes with more to be added.  The storage nodes have spec:
- Dell R730xd 2 x Xeon E5-2650 v3 @ 2.30GHz (20 phys cores)
- 384 GB RAM
- 60 x 8TB HGST HUH728080AL5204 in MD3060e enclosure attached via 2 x
LSI 9207-8e SAS 6Gbps
- XFS filesystem on OSD data devs
- 4 x 400GB NVMe arranged into 2 mdraid devices for journals (30 per
raid-1 device)
- 2 x 25Gb Mellanox ConnectX-4 Lx dual port (4 x 25Gb
- Jewel release

I don't have much to add for tuning advice.   I'm reading this thread
for tuning advice and upcoming problems.  We've done a little network
tuning based on Mellanox recommendations but nothing specific to ceph
(in fact we just use the scripts that come with the Mellanox driver
packages).  We haven't hit any major issues so far in trying out RBD
and RGW but we haven't taxed anything yet.

thanks,
Ben

On Wed, May 18, 2016 at 9:14 AM, Brian Felton <bjfelton@xxxxxxxxx> wrote:
> At my current gig, we are running five (soon to be six) pure object storage
> clusters in production with the following specs:
>
>  - 9 nodes
>  - 32 cores, 256 GB RAM per node
>  - 72 6 TB SAS spinners per node (648 total per cluster)
>  - 7,2 erasure coded pool for RGW buckets
>  - ZFS as the filesystem on the OSDs with collocated journals
>  - Hammer release with rgw patches
>
> We are currently storing a few hundred TB of data across several hundred
> MObjects.
>
> We have hit the following issues:
>
>  - Filestore merge splits occur at ~40 MObjects with default settings.  This
> is a really, really bad couple of days while things settle.
>  - Realizing that, with erasure coding, scrubs have the same impact as deep
> scrubs
>  - Scrubs causing a slew of blocked/slow requests and stale pgs
>  - A handful of RGW issues
>
> As utilization has grown, the performance impact of scrubbing has become
> much more noticeable, to the point that we've had to hand-roll software to
> manage the scrubs and keep them at a very reduced rate.  SSD journals are
> your friends, folks.  Don't skimp.  We are in the process of retrofitting
> these clusters with SSD journals to help speed things up.  We are also
> evaluating BlueStore in Jewel to see how it compares to LevelDB as well as
> different node configurations (less dense and with SSD journals).
>
> Brian
>
> On Wed, May 18, 2016 at 12:54 AM, Blair Bethwaite
> <blair.bethwaite@xxxxxxxxx> wrote:
>>
>> Hi all,
>>
>> What are the densest node configs out there, and what are your
>> experiences with them and tuning required to make them work? If we can
>> gather enough info here then I'll volunteer to propose some upstream
>> docs covering this.
>>
>> At Monash we currently have some 32-OSD nodes (running RHEL7), though
>> 8 of those OSDs are not storing or doing much yet (in a quiet EC'd RGW
>> pool), the other 24 OSDs are serving RBD and at perhaps 65% full on
>> average - these are 4TB drives.
>>
>> Aside from the already documented pid_max increases that are typically
>> necessary just to start all OSDs, we've also had to up
>> nf_conntrack_max. We've hit issues (twice now) that seem (have not
>> figured out exactly how to confirm this yet) to be related to kernel
>> dentry slab cache exhaustion - symptoms were a major slow down in
>> performance and slow requests all over the place on writes, watching
>> OSD iostat would show a single drive hitting 90+% util for ~15s with a
>> bunch of small reads and no writes. These issues were worked around by
>> tuning up filestore split and merge thresholds, though if we'd known
>> about this earlier we'd probably have just bumped up the default
>> object size so that we simply had fewer objects (and/or rounded up the
>> PG count to the next power of 2). We also set vfs_cache_pressure to 1,
>> though this didn't really seem to do much at the time. I've also seen
>> recommendations about setting min_free_kbytes to something higher
>> (currently 90112 on our hardware) but have not verified this.
>>
>> --
>> Cheers,
>> ~Blairo
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com