Re: Ceph cluster on AMD based system.

Christian Balzer <chibi@xxxxxxx> · Wed, 6 Mar 2019 10:19:00 +0900

On Tue, 5 Mar 2019 10:39:14 -0600 Mark Nelson wrote:

> On 3/5/19 10:20 AM, Darius Kasparavičius wrote:
> > Thank you for your response.
> >
> > I was planning to use a 100GbE or 45GbE bond for this cluster. It was
> > acceptable for our use case to lose sequential/larger I/O speed for
> > it.  Dual socket would be and option, but I do not want to touch numa,
> > cgroups and the rest settings. Most of the time is just easier to add
> > a higher clock CPU or more cores. The plan is currently for 2xosd per
> > nvme device, but if testing shows that it’s better to use one. We will
> > stick with one. Which RocksDB settings would you recommend tweaking? I
> > haven’t had the chance to test them yet. Most of the clusters I have
> > access to are using leveldb and are still running filestore.  
> 
> 
> Yeah, numa makes everything more complicated.  I'd just consider jumping 
> up to the 7601 then if IOPS is a concern and know that you might still 
> be CPU bound (though it's also possible you could also hit some other 
> bottleneck before it becomes an issue).  Given that the cores aren't 
> clocked super high it's possible that you might see a benefit to 2x 
> OSDs/device.
> 
With EPYC CPUs and their rather studly interconnect NUMA feels less of an
issue than previous generations. 
Of course pinning would still be beneficial.

That said, avoiding it altogether if you can (afford it) is of course the
easiest thing to do.

Christian

> 
> RocksDB is tough.  Right now we are heavily tuned to favor reducing 
> write amplification but eat CPU to do it.  That can help performance 
> when write throughput is a bottleneck and also reduces wear on the drive 
> (which is always good, but especially with low write endurance drives).  
> Reducing the size of the WAL buffers will (probably) reduce CPU usage 
> and also reduce the amount of memory used by the OSD, but we've observed 
> higher write-amplification on our test nodes.  I suspect that might be a 
> worthwhile trade-off for nvdimms or optane, but I'm not sure it's a good 
> idea for typical NVMe drives.
> 
> 
> Mark
> 
> 
> >
> > On Tue, Mar 5, 2019 at 5:35 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:  
> >> Hi,
> >>
> >>
> >> I've got a ryzen7 1700 box that I regularly run tests on along with the
> >> upstream community performance test nodes that have Intel Xeon E5-2650v3
> >> processors in them.  The Ryzen is 3.0GHz/3.7GHz turbo while the Xeons
> >> are 2.3GHz/3.0GHz.  The Xeons are quite a bit faster clock/clock in the
> >> tests I've done with Ceph. Typically I see a single OSD using fewer
> >> cores on the Xeon processors vs Ryzen to hit similar performance numbers
> >> despite being clocked lower (though I haven't verified the turbo
> >> frequencies of both under load).  On the other hand, the Ryzen processor
> >> is significantly cheaper per core.  If you only looked at cores you'd
> >> think something like Ryzen would be the way to go, but there are other
> >> things to consider.  The number of PCIE lanes, memory configuration,
> >> cache configuration, and CPU interconnect (in multi-socket
> >> configurations) all start becoming really important if you are targeting
> >> multiple NVMe drives like what you are talking about below.  The EPYC
> >> processors give you more of all of that, but also costs a lot more than
> >> Ryzen.  Ultimately the CPU is only a small part of the price for nodes
> >> like this so I wouldn't skimp if your goal is to maximize IOPS.
> >>
> >>
> >> With 10 NVMe drives per node, I'm guessing that a single EPYC 7451 is
> >> going to be CPU bound for small IO workloads (2.4c/4.8t per OSD), but
> >> will be network bound for large IO workloads unless you are sticking
> >> 2x100GbE in.  You might want to consider jumping up to the 7601.  That
> >> would get you closer to where you want to be for 10 NVMe drives
> >> (3.2c/6.4t per OSD).  Another option might be dual 7351s in this chassis:
> >>
> >> https://www.supermicro.com/Aplus/system/1U/1123/AS-1123US-TN10RT.cfm
> >>
> >>
> >> Figure that with sufficient client parallelism/load you'll get about
> >> 3000-6000 read IOPS/core and about 1500-3000 write IOPS/core (before
> >> replication) with OSDs typically topping out at a max of about 6-8 cores
> >> each.  Doubling up OSDs on each NVMe drive might improve or hurt
> >> performance depending on what the limitations are (typically it seems to
> >> help most when the kv sync thread is the primary bottleneck in
> >> bluestore, which most likely happens with tons of slow cores and very
> >> fast NVMe drives).  Those are all very rough hand-wavy numbers and
> >> depend on a huge variety of factors so take them with a grain of salt.
> >> Doing things like disabling authentication, disabling logging, forcing
> >> high level P/C states, tweaking RocksDB WAL and compaction settings, the
> >> number of osd shards/threads, and the system numa configuration might
> >> get you higher performance/core, though it's all pretty hard to predict
> >> without outright testing it.
> >>
> >>
> >> Though you didn't ask about it, probably the most important thing you
> >> can spend money on with NVMe drives is getting high write endurance
> >> (DWPD) if you expect even a moderately high write workload.
> >>
> >>
> >> Mark
> >>
> >>
> >> On 3/5/19 3:49 AM, Darius Kasparavičius wrote:  
> >>> Hello,
> >>>
> >>>
> >>> I was thinking of using AMD based system for my new nvme based
> >>> cluster. In particular I'm looking at
> >>> https://www.supermicro.com/Aplus/system/1U/1113/AS-1113S-WN10RT.cfm
> >>> and https://www.amd.com/en/products/cpu/amd-epyc-7451 CPU's. Have
> >>> anyone tried running it on this particular hardware?
> >>>
> >>> General idea is 6 nodes with 10 nvme drives and 2 osds per nvme drive.
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@xxxxxxxxxxxxxx
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com