Re: Yet another hardware planning question ...

Christian Balzer <chibi@xxxxxxx> · Thu, 20 Oct 2016 17:28:02 +0900

Hello,

On Thu, 20 Oct 2016 07:56:55 +0000 Patrik Martinsson wrote:

> Hi Christian, 
> 
> Thanks for your very detailed and thorough explanation, very much
> appreciated. 
> 
You're welcome.

> We have definitely thought of a design where we have dedicated nvme-
> pools for 'high-performance' as you say. 
>
Having the whole cluster blazingly fast is of course a nice goal, but how
to achieve this without breaking the bank is always the more tricky part.

In my (very specific) use case I was able to go from a totally overloaded
cluster to a very bored one by just adding a small cache-tier, but that
works so well because the clients here are well known and under our
control, they're all the same and are happy if they can scribble away tiny
amounts of data within 5ms, a perfect fit for this.

> At the same time I *thought* that having the journal offloaded to
> another device *always* was the best solution 
>  - if you use mainly spinners, have the journals on ssd's
>  - if you mainly use ssd's, have journals on nvme's 
>
Quite so. If you have a large/unlimited budget, go for it.

> But that's not always the case I guess, and thanks for pointing that
> out. 
>
Again, matching things up in terms of speed (network vs journal vs OSD),
endurance and size is both involved and gets costly quickly.

Christian

> Best regards, 
> Patrik Martinsson 
> Sweden
> 
> 
> On fre, 2016-10-14 at 09:59 +0900, Christian Balzer wrote:
> > Hello,
> > 
> > On Thu, 13 Oct 2016 15:46:03 +0000 Patrik Martinsson wrote:
> > 
> > > 
> > > On tor, 2016-10-13 at 10:29 -0500, Brady Deetz wrote:
> > > > 
> > > > 6 SSD per nvme journal might leave your journal in contention.
> > > > Canyou
> > > > provide the specific models you will be using?
> > > 
> > > Well, according to Dell, the card is called "Dell 1.6TB, NVMe,
> > > Mixed
> > > Use Express Flash, PM1725", but the specs for the card is listed
> > > here h
> > > ttp://i.dell.com/sites/doccontent/shared-content/data-
> > > sheets/en/Documents/Dell-PowerEdge-Express-Flash-NVMe-Mixed-Use-
> > > PCIe-
> > > SSD.pdf
> > > 
> > That's a re-branded (not much, same model number) Samsung.
> > Both that link and the equivalent Samsung link are not what I would
> > consider professional, with their "up to" speeds.
> > Because that usually is a fact of design and flash modules used,
> > typically
> > resulting in smaller drives being slower (less parallelism).
> > 
> > Extrapolating from the 3.2 TB model we can assume that these can not
> > write
> > more than 2MB/s.
> > 
> > If your 40Gb/s network is single ported or active/standby (you didn't
> > mention), then this is fine, as 2 of these journals NVMes would be a
> > perfect match.
> > If it's dual-ported with MC-LAG, then you're wasting half of the
> > potential
> > bandwidth. 
> > 
> > Also these NVMes have a nice, feel good 5 DWPD, for future
> > reference. 
> > 
> > > 
> > > Forgive me for my poor English here, but when you say "leave your
> > > journal in contention", what exactly do you mean by that ?
> > > 
> > He means that the combined bandwidth of your SSDs will be larger than
> > those of your journal NVMe's, limiting the top bandwidth your nodes
> > can
> > write at to those of the journals.
> > 
> > In your case we're missing any pertinent details about the SSDs as
> > well.
> > 
> > An educated guess (size, 12Gbs link, Samsung) makes them these:
> > http://www.samsung.com/semiconductor/products/flash-storage/enterpris
> > e-ssd/MZILS1T9HCHP?ia=832
> > http://www.samsung.com/semiconductor/global/file/media/PM853T.pdf
> > 
> > So 750MB/s sequential writes, 3 of these can already handle more than
> > your
> > NVMe.
> > 
> > However the 1 DWPD (the PDF is more detailed and gives us a scary 0.3
> > DWPD
> > for small I/Os) of these SSDs would definitely stop me from
> > considering
> > them.
> > Unless you can quantify your write volume with certainty and it's
> > below
> > the level these SSDs can support, go for something safer, at least 3
> > DWPD.
> > 
> > Quick estimate:
> > 24 SSDs (replication of 3) * 1.92TB * 0.3 (worst case) = 13.8TB/day 
> > That's ignoring further overhead and write amplification by the FS
> > (journals) and Ceph itself.
> > So if your cluster sees less than 10TB writes/day, you may at least
> > assume
> > it won't kill those SSDs within months.
> > 
> > Your journal NVMes are incidentally a decent match endurance wise at
> > a
> > (much more predictable) 16TB/day.
> > 
> > 
> > The above is of course all about bandwidth (sequential writes), which
> > are
> > important in certain use cases and during backfill/recovery actions.
> > 
> > Since your use case suggest more of a DB, smallish data transactions
> > scenario, that "waste" of bandwidth may be totally acceptable.
> > All my clusters certainly favor lower latency over higher bandwidth
> > when
> > having to choose between either. 
> > 
> > It comes back to use case and write volume, those journal NVMes will
> > help
> > with keeping latency low (for your DBs) so if that is paramount, go
> > with
> > that.
> > 
> > They do feel a bit wasted (1.6TB, of which you'll use 1-200MB at
> > most),
> > though.
> > Consider alternative designs where you have special pools for high
> > performance needs on NVMes and use 3+DWPD SSDs (journals inline) for
> > the
> > rest.
> > 
> > Also I'd use the E5-2697A v4 CPU instead with SSDs (faster baseline
> > and
> > Turbo).
> > 
> > Christian
> > 
> > > 
> > > Best regards, 
> > > Patrik Martinsson
> > > Sweden
> > > 
> > > 
> > > > 
> > > > On Oct 13, 2016 10:23 AM, "Patrik Martinsson"
> > > > <patrik.martinsson@xxxxxxxxxxxxx> wrote:
> > > > > 
> > > > > Hello everyone, 
> > > > > 
> > > > > We are in the process of buying hardware for our first ceph-
> > > > > cluster. We
> > > > > will start with some testing and do some performance
> > > > > measurements
> > > > > to
> > > > > see that we are on the right track, and once we are satisfied
> > > > > with
> > > > > our
> > > > > setup we'll continue to grow in it as time comes along.
> > > > > 
> > > > > Now, I'm just seeking some thoughts on our future hardware, I
> > > > > know
> > > > > there are a lot of these kind of questions out there, so please
> > > > > forgive
> > > > > me for posting another one. 
> > > > > 
> > > > > Details, 
> > > > > - Cluster will be in the same datacenter, multiple racks as we
> > > > > grow 
> > > > > - Typical workload (this is incredible vague, forgive me again)
> > > > > would
> > > > > be an Openstack environment, hosting 150~200 vms, we'll have
> > > > > quite
> > > > > a
> > > > > few databases for Jira/Confluence/etc. Some workload coming
> > > > > from
> > > > > Stash/Bamboo agents, puppet master/foreman, and other typical
> > > > > "core
> > > > > infra stuff". 
> > > > > 
> > > > > Given this prerequisites just given, the going all SSD's (and
> > > > > NVME
> > > > > for
> > > > > journals) may seem as overkill(?), but we feel like we can
> > > > > afford
> > > > > it
> > > > > and it will be a benefit for us in the future. 
> > > > > 
> > > > > Planned hardware, 
> > > > > 
> > > > > Six nodes to begin with (which would give us a cluster size of
> > > > > ~46TB,
> > > > > with a default replica of three (although probably a bit bigger
> > > > > since
> > > > > the vm's would be backed by a erasure coded pool) will look
> > > > > something
> > > > > like, 
> > > > >  - 1x  Intel E5-2695 v4 2.1GHz, 45M Cache, 18 Cores
> > > > >  - 2x  Dell 64 GB RDIMM 2400MT
> > > > >  - 12x Dell 1.92TB Mix Use MLC 12Gbps (separate OS disks) 
> > > > >  - 2x  Dell 1.6TB NVMe Mixed usage (6 osd's per NVME)
> > > > > 
> > > > > Network between all nodes within a rack will be 40Gbit (and
> > > > > 200Gbit
> > > > > between racks), backed by Junipers QFX5200-32C.
> > > > > 
> > > > > Rather then asking the question, 
> > > > > - "Does this seems reasonable for our workload ?", 
> > > > > 
> > > > > I want to ask,
> > > > > - "Is there any reason *not* have a setup like this, is there
> > > > > any
> > > > > obvious bottlenecks or flaws that we are missing or could this
> > > > > may
> > > > > very
> > > > > well work as good start (and the ability to grow with adding
> > > > > more
> > > > > servers) ?"
> > > > > 
> > > > > When it comes to workload-wise-issues, I think we'll just have
> > > > > to
> > > > > see
> > > > > and grow as we learn. 
> > > > > 
> > > > > We'll be grateful for any input, thoughts, ideas suggestions,
> > > > > you
> > > > > name
> > > > > it. 
> > > > > 
> > > > > Best regards, 
> > > > > Patrik Martinsson,
> > > > > Sweden
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > 
> > 
> > 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com