Re: Yet another hardware planning question ...

Christian Balzer <chibi@xxxxxxx> · Fri, 14 Oct 2016 09:59:12 +0900

Hello,

On Thu, 13 Oct 2016 15:46:03 +0000 Patrik Martinsson wrote:

> On tor, 2016-10-13 at 10:29 -0500, Brady Deetz wrote:
> > 6 SSD per nvme journal might leave your journal in contention. Canyou
> > provide the specific models you will be using?
> 
> Well, according to Dell, the card is called "Dell 1.6TB, NVMe, Mixed
> Use Express Flash, PM1725", but the specs for the card is listed here h
> ttp://i.dell.com/sites/doccontent/shared-content/data-
> sheets/en/Documents/Dell-PowerEdge-Express-Flash-NVMe-Mixed-Use-PCIe-
> SSD.pdf
>
That's a re-branded (not much, same model number) Samsung.
Both that link and the equivalent Samsung link are not what I would
consider professional, with their "up to" speeds.
Because that usually is a fact of design and flash modules used, typically
resulting in smaller drives being slower (less parallelism).

Extrapolating from the 3.2 TB model we can assume that these can not write
more than 2MB/s.

If your 40Gb/s network is single ported or active/standby (you didn't
mention), then this is fine, as 2 of these journals NVMes would be a
perfect match.
If it's dual-ported with MC-LAG, then you're wasting half of the potential
bandwidth. 

Also these NVMes have a nice, feel good 5 DWPD, for future reference. 

> Forgive me for my poor English here, but when you say "leave your
> journal in contention", what exactly do you mean by that ?
> 
He means that the combined bandwidth of your SSDs will be larger than
those of your journal NVMe's, limiting the top bandwidth your nodes can
write at to those of the journals.

In your case we're missing any pertinent details about the SSDs as well.

An educated guess (size, 12Gbs link, Samsung) makes them these:
http://www.samsung.com/semiconductor/products/flash-storage/enterprise-ssd/MZILS1T9HCHP?ia=832
http://www.samsung.com/semiconductor/global/file/media/PM853T.pdf

So 750MB/s sequential writes, 3 of these can already handle more than your
NVMe.

However the 1 DWPD (the PDF is more detailed and gives us a scary 0.3 DWPD
for small I/Os) of these SSDs would definitely stop me from considering
them.
Unless you can quantify your write volume with certainty and it's below
the level these SSDs can support, go for something safer, at least 3 DWPD.

Quick estimate:
24 SSDs (replication of 3) * 1.92TB * 0.3 (worst case) = 13.8TB/day 
That's ignoring further overhead and write amplification by the FS
(journals) and Ceph itself.
So if your cluster sees less than 10TB writes/day, you may at least assume
it won't kill those SSDs within months.

Your journal NVMes are incidentally a decent match endurance wise at a
(much more predictable) 16TB/day.

The above is of course all about bandwidth (sequential writes), which are
important in certain use cases and during backfill/recovery actions.

Since your use case suggest more of a DB, smallish data transactions
scenario, that "waste" of bandwidth may be totally acceptable.
All my clusters certainly favor lower latency over higher bandwidth when
having to choose between either. 

It comes back to use case and write volume, those journal NVMes will help
with keeping latency low (for your DBs) so if that is paramount, go with
that.

They do feel a bit wasted (1.6TB, of which you'll use 1-200MB at most),
though.
Consider alternative designs where you have special pools for high
performance needs on NVMes and use 3+DWPD SSDs (journals inline) for the
rest.

Also I'd use the E5-2697A v4 CPU instead with SSDs (faster baseline and
Turbo).

Christian

> Best regards, 
> Patrik Martinsson
> Sweden
> 
> 
> > On Oct 13, 2016 10:23 AM, "Patrik Martinsson"
> > <patrik.martinsson@xxxxxxxxxxxxx> wrote:
> > > Hello everyone, 
> > > 
> > > We are in the process of buying hardware for our first ceph-
> > > cluster. We
> > > will start with some testing and do some performance measurements
> > > to
> > > see that we are on the right track, and once we are satisfied with
> > > our
> > > setup we'll continue to grow in it as time comes along.
> > > 
> > > Now, I'm just seeking some thoughts on our future hardware, I know
> > > there are a lot of these kind of questions out there, so please
> > > forgive
> > > me for posting another one. 
> > > 
> > > Details, 
> > > - Cluster will be in the same datacenter, multiple racks as we
> > > grow 
> > > - Typical workload (this is incredible vague, forgive me again)
> > > would
> > > be an Openstack environment, hosting 150~200 vms, we'll have quite
> > > a
> > > few databases for Jira/Confluence/etc. Some workload coming from
> > > Stash/Bamboo agents, puppet master/foreman, and other typical "core
> > > infra stuff". 
> > > 
> > > Given this prerequisites just given, the going all SSD's (and NVME
> > > for
> > > journals) may seem as overkill(?), but we feel like we can afford
> > > it
> > > and it will be a benefit for us in the future. 
> > > 
> > > Planned hardware, 
> > > 
> > > Six nodes to begin with (which would give us a cluster size of
> > > ~46TB,
> > > with a default replica of three (although probably a bit bigger
> > > since
> > > the vm's would be backed by a erasure coded pool) will look
> > > something
> > > like, 
> > >  - 1x  Intel E5-2695 v4 2.1GHz, 45M Cache, 18 Cores
> > >  - 2x  Dell 64 GB RDIMM 2400MT
> > >  - 12x Dell 1.92TB Mix Use MLC 12Gbps (separate OS disks) 
> > >  - 2x  Dell 1.6TB NVMe Mixed usage (6 osd's per NVME)
> > > 
> > > Network between all nodes within a rack will be 40Gbit (and 200Gbit
> > > between racks), backed by Junipers QFX5200-32C.
> > > 
> > > Rather then asking the question, 
> > > - "Does this seems reasonable for our workload ?", 
> > > 
> > > I want to ask,
> > > - "Is there any reason *not* have a setup like this, is there any
> > > obvious bottlenecks or flaws that we are missing or could this may
> > > very
> > > well work as good start (and the ability to grow with adding more
> > > servers) ?"
> > > 
> > > When it comes to workload-wise-issues, I think we'll just have to
> > > see
> > > and grow as we learn. 
> > > 
> > > We'll be grateful for any input, thoughts, ideas suggestions, you
> > > name
> > > it. 
> > > 
> > > Best regards, 
> > > Patrik Martinsson,
> > > Sweden
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com