Re: Yet another hardware planning question ...

Patrik Martinsson <patrik.martinsson@xxxxxxxxxxxxx> · Thu, 20 Oct 2016 07:56:55 +0000



Hi Christian, 

Thanks for your very detailed and thorough explanation, very much
appreciated. 

We have definitely thought of a design where we have dedicated nvme-
pools for 'high-performance' as you say. 

At the same time I *thought* that having the journal offloaded to
another device *always* was the best solution 
 - if you use mainly spinners, have the journals on ssd's
 - if you mainly use ssd's, have journals on nvme's 

But that's not always the case I guess, and thanks for pointing that
out. 

Best regards, 
Patrik Martinsson 
Sweden


On fre, 2016-10-14 at 09:59 +0900, Christian Balzer wrote:
> Hello,
> 
> On Thu, 13 Oct 2016 15:46:03 +0000 Patrik Martinsson wrote:
> 
> > 
> > On tor, 2016-10-13 at 10:29 -0500, Brady Deetz wrote:
> > > 
> > > 6 SSD per nvme journal might leave your journal in contention.
> > > Canyou
> > > provide the specific models you will be using?
> > 
> > Well, according to Dell, the card is called "Dell 1.6TB, NVMe,
> > Mixed
> > Use Express Flash, PM1725", but the specs for the card is listed
> > here h
> > ttp://i.dell.com/sites/doccontent/shared-content/data-
> > sheets/en/Documents/Dell-PowerEdge-Express-Flash-NVMe-Mixed-Use-
> > PCIe-
> > SSD.pdf
> > 
> That's a re-branded (not much, same model number) Samsung.
> Both that link and the equivalent Samsung link are not what I would
> consider professional, with their "up to" speeds.
> Because that usually is a fact of design and flash modules used,
> typically
> resulting in smaller drives being slower (less parallelism).
> 
> Extrapolating from the 3.2 TB model we can assume that these can not
> write
> more than 2MB/s.
> 
> If your 40Gb/s network is single ported or active/standby (you didn't
> mention), then this is fine, as 2 of these journals NVMes would be a
> perfect match.
> If it's dual-ported with MC-LAG, then you're wasting half of the
> potential
> bandwidth. 
> 
> Also these NVMes have a nice, feel good 5 DWPD, for future
> reference. 
> 
> > 
> > Forgive me for my poor English here, but when you say "leave your
> > journal in contention", what exactly do you mean by that ?
> > 
> He means that the combined bandwidth of your SSDs will be larger than
> those of your journal NVMe's, limiting the top bandwidth your nodes
> can
> write at to those of the journals.
> 
> In your case we're missing any pertinent details about the SSDs as
> well.
> 
> An educated guess (size, 12Gbs link, Samsung) makes them these:
> http://www.samsung.com/semiconductor/products/flash-storage/enterpris
> e-ssd/MZILS1T9HCHP?ia=832
> http://www.samsung.com/semiconductor/global/file/media/PM853T.pdf
> 
> So 750MB/s sequential writes, 3 of these can already handle more than
> your
> NVMe.
> 
> However the 1 DWPD (the PDF is more detailed and gives us a scary 0.3
> DWPD
> for small I/Os) of these SSDs would definitely stop me from
> considering
> them.
> Unless you can quantify your write volume with certainty and it's
> below
> the level these SSDs can support, go for something safer, at least 3
> DWPD.
> 
> Quick estimate:
> 24 SSDs (replication of 3) * 1.92TB * 0.3 (worst case) = 13.8TB/day 
> That's ignoring further overhead and write amplification by the FS
> (journals) and Ceph itself.
> So if your cluster sees less than 10TB writes/day, you may at least
> assume
> it won't kill those SSDs within months.
> 
> Your journal NVMes are incidentally a decent match endurance wise at
> a
> (much more predictable) 16TB/day.
> 
> 
> The above is of course all about bandwidth (sequential writes), which
> are
> important in certain use cases and during backfill/recovery actions.
> 
> Since your use case suggest more of a DB, smallish data transactions
> scenario, that "waste" of bandwidth may be totally acceptable.
> All my clusters certainly favor lower latency over higher bandwidth
> when
> having to choose between either. 
> 
> It comes back to use case and write volume, those journal NVMes will
> help
> with keeping latency low (for your DBs) so if that is paramount, go
> with
> that.
> 
> They do feel a bit wasted (1.6TB, of which you'll use 1-200MB at
> most),
> though.
> Consider alternative designs where you have special pools for high
> performance needs on NVMes and use 3+DWPD SSDs (journals inline) for
> the
> rest.
> 
> Also I'd use the E5-2697A v4 CPU instead with SSDs (faster baseline
> and
> Turbo).
> 
> Christian
> 
> > 
> > Best regards, 
> > Patrik Martinsson
> > Sweden
> > 
> > 
> > > 
> > > On Oct 13, 2016 10:23 AM, "Patrik Martinsson"
> > > <patrik.martinsson@xxxxxxxxxxxxx> wrote:
> > > > 
> > > > Hello everyone, 
> > > > 
> > > > We are in the process of buying hardware for our first ceph-
> > > > cluster. We
> > > > will start with some testing and do some performance
> > > > measurements
> > > > to
> > > > see that we are on the right track, and once we are satisfied
> > > > with
> > > > our
> > > > setup we'll continue to grow in it as time comes along.
> > > > 
> > > > Now, I'm just seeking some thoughts on our future hardware, I
> > > > know
> > > > there are a lot of these kind of questions out there, so please
> > > > forgive
> > > > me for posting another one. 
> > > > 
> > > > Details, 
> > > > - Cluster will be in the same datacenter, multiple racks as we
> > > > grow 
> > > > - Typical workload (this is incredible vague, forgive me again)
> > > > would
> > > > be an Openstack environment, hosting 150~200 vms, we'll have
> > > > quite
> > > > a
> > > > few databases for Jira/Confluence/etc. Some workload coming
> > > > from
> > > > Stash/Bamboo agents, puppet master/foreman, and other typical
> > > > "core
> > > > infra stuff". 
> > > > 
> > > > Given this prerequisites just given, the going all SSD's (and
> > > > NVME
> > > > for
> > > > journals) may seem as overkill(?), but we feel like we can
> > > > afford
> > > > it
> > > > and it will be a benefit for us in the future. 
> > > > 
> > > > Planned hardware, 
> > > > 
> > > > Six nodes to begin with (which would give us a cluster size of
> > > > ~46TB,
> > > > with a default replica of three (although probably a bit bigger
> > > > since
> > > > the vm's would be backed by a erasure coded pool) will look
> > > > something
> > > > like, 
> > > >  - 1x  Intel E5-2695 v4 2.1GHz, 45M Cache, 18 Cores
> > > >  - 2x  Dell 64 GB RDIMM 2400MT
> > > >  - 12x Dell 1.92TB Mix Use MLC 12Gbps (separate OS disks) 
> > > >  - 2x  Dell 1.6TB NVMe Mixed usage (6 osd's per NVME)
> > > > 
> > > > Network between all nodes within a rack will be 40Gbit (and
> > > > 200Gbit
> > > > between racks), backed by Junipers QFX5200-32C.
> > > > 
> > > > Rather then asking the question, 
> > > > - "Does this seems reasonable for our workload ?", 
> > > > 
> > > > I want to ask,
> > > > - "Is there any reason *not* have a setup like this, is there
> > > > any
> > > > obvious bottlenecks or flaws that we are missing or could this
> > > > may
> > > > very
> > > > well work as good start (and the ability to grow with adding
> > > > more
> > > > servers) ?"
> > > > 
> > > > When it comes to workload-wise-issues, I think we'll just have
> > > > to
> > > > see
> > > > and grow as we learn. 
> > > > 
> > > > We'll be grateful for any input, thoughts, ideas suggestions,
> > > > you
> > > > name
> > > > it. 
> > > > 
> > > > Best regards, 
> > > > Patrik Martinsson,
> > > > Sweden
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > 
> 
> 
-- 
Kindly regards,
Patrik Martinsson
0707 - 27 64 96
System Administrator Linux
Genuine Happiness
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com