Hi Christian, Thanks for your very detailed and thorough explanation, very much appreciated. We have definitely thought of a design where we have dedicated nvme- pools for 'high-performance' as you say. At the same time I *thought* that having the journal offloaded to another device *always* was the best solution - if you use mainly spinners, have the journals on ssd's - if you mainly use ssd's, have journals on nvme's But that's not always the case I guess, and thanks for pointing that out. Best regards, Patrik Martinsson Sweden On fre, 2016-10-14 at 09:59 +0900, Christian Balzer wrote: > Hello, > > On Thu, 13 Oct 2016 15:46:03 +0000 Patrik Martinsson wrote: > > > > > On tor, 2016-10-13 at 10:29 -0500, Brady Deetz wrote: > > > > > > 6 SSD per nvme journal might leave your journal in contention. > > > Canyou > > > provide the specific models you will be using? > > > > Well, according to Dell, the card is called "Dell 1.6TB, NVMe, > > Mixed > > Use Express Flash, PM1725", but the specs for the card is listed > > here h > > ttp://i.dell.com/sites/doccontent/shared-content/data- > > sheets/en/Documents/Dell-PowerEdge-Express-Flash-NVMe-Mixed-Use- > > PCIe- > > SSD.pdf > > > That's a re-branded (not much, same model number) Samsung. > Both that link and the equivalent Samsung link are not what I would > consider professional, with their "up to" speeds. > Because that usually is a fact of design and flash modules used, > typically > resulting in smaller drives being slower (less parallelism). > > Extrapolating from the 3.2 TB model we can assume that these can not > write > more than 2MB/s. > > If your 40Gb/s network is single ported or active/standby (you didn't > mention), then this is fine, as 2 of these journals NVMes would be a > perfect match. > If it's dual-ported with MC-LAG, then you're wasting half of the > potential > bandwidth. > > Also these NVMes have a nice, feel good 5 DWPD, for future > reference. > > > > > Forgive me for my poor English here, but when you say "leave your > > journal in contention", what exactly do you mean by that ? > > > He means that the combined bandwidth of your SSDs will be larger than > those of your journal NVMe's, limiting the top bandwidth your nodes > can > write at to those of the journals. > > In your case we're missing any pertinent details about the SSDs as > well. > > An educated guess (size, 12Gbs link, Samsung) makes them these: > http://www.samsung.com/semiconductor/products/flash-storage/enterpris > e-ssd/MZILS1T9HCHP?ia=832 > http://www.samsung.com/semiconductor/global/file/media/PM853T.pdf > > So 750MB/s sequential writes, 3 of these can already handle more than > your > NVMe. > > However the 1 DWPD (the PDF is more detailed and gives us a scary 0.3 > DWPD > for small I/Os) of these SSDs would definitely stop me from > considering > them. > Unless you can quantify your write volume with certainty and it's > below > the level these SSDs can support, go for something safer, at least 3 > DWPD. > > Quick estimate: > 24 SSDs (replication of 3) * 1.92TB * 0.3 (worst case) = 13.8TB/day > That's ignoring further overhead and write amplification by the FS > (journals) and Ceph itself. > So if your cluster sees less than 10TB writes/day, you may at least > assume > it won't kill those SSDs within months. > > Your journal NVMes are incidentally a decent match endurance wise at > a > (much more predictable) 16TB/day. > > > The above is of course all about bandwidth (sequential writes), which > are > important in certain use cases and during backfill/recovery actions. > > Since your use case suggest more of a DB, smallish data transactions > scenario, that "waste" of bandwidth may be totally acceptable. > All my clusters certainly favor lower latency over higher bandwidth > when > having to choose between either. > > It comes back to use case and write volume, those journal NVMes will > help > with keeping latency low (for your DBs) so if that is paramount, go > with > that. > > They do feel a bit wasted (1.6TB, of which you'll use 1-200MB at > most), > though. > Consider alternative designs where you have special pools for high > performance needs on NVMes and use 3+DWPD SSDs (journals inline) for > the > rest. > > Also I'd use the E5-2697A v4 CPU instead with SSDs (faster baseline > and > Turbo). > > Christian > > > > > Best regards, > > Patrik Martinsson > > Sweden > > > > > > > > > > On Oct 13, 2016 10:23 AM, "Patrik Martinsson" > > > <patrik.martinsson@xxxxxxxxxxxxx> wrote: > > > > > > > > Hello everyone, > > > > > > > > We are in the process of buying hardware for our first ceph- > > > > cluster. We > > > > will start with some testing and do some performance > > > > measurements > > > > to > > > > see that we are on the right track, and once we are satisfied > > > > with > > > > our > > > > setup we'll continue to grow in it as time comes along. > > > > > > > > Now, I'm just seeking some thoughts on our future hardware, I > > > > know > > > > there are a lot of these kind of questions out there, so please > > > > forgive > > > > me for posting another one. > > > > > > > > Details, > > > > - Cluster will be in the same datacenter, multiple racks as we > > > > grow > > > > - Typical workload (this is incredible vague, forgive me again) > > > > would > > > > be an Openstack environment, hosting 150~200 vms, we'll have > > > > quite > > > > a > > > > few databases for Jira/Confluence/etc. Some workload coming > > > > from > > > > Stash/Bamboo agents, puppet master/foreman, and other typical > > > > "core > > > > infra stuff". > > > > > > > > Given this prerequisites just given, the going all SSD's (and > > > > NVME > > > > for > > > > journals) may seem as overkill(?), but we feel like we can > > > > afford > > > > it > > > > and it will be a benefit for us in the future. > > > > > > > > Planned hardware, > > > > > > > > Six nodes to begin with (which would give us a cluster size of > > > > ~46TB, > > > > with a default replica of three (although probably a bit bigger > > > > since > > > > the vm's would be backed by a erasure coded pool) will look > > > > something > > > > like, > > > > - 1x Intel E5-2695 v4 2.1GHz, 45M Cache, 18 Cores > > > > - 2x Dell 64 GB RDIMM 2400MT > > > > - 12x Dell 1.92TB Mix Use MLC 12Gbps (separate OS disks) > > > > - 2x Dell 1.6TB NVMe Mixed usage (6 osd's per NVME) > > > > > > > > Network between all nodes within a rack will be 40Gbit (and > > > > 200Gbit > > > > between racks), backed by Junipers QFX5200-32C. > > > > > > > > Rather then asking the question, > > > > - "Does this seems reasonable for our workload ?", > > > > > > > > I want to ask, > > > > - "Is there any reason *not* have a setup like this, is there > > > > any > > > > obvious bottlenecks or flaws that we are missing or could this > > > > may > > > > very > > > > well work as good start (and the ability to grow with adding > > > > more > > > > servers) ?" > > > > > > > > When it comes to workload-wise-issues, I think we'll just have > > > > to > > > > see > > > > and grow as we learn. > > > > > > > > We'll be grateful for any input, thoughts, ideas suggestions, > > > > you > > > > name > > > > it. > > > > > > > > Best regards, > > > > Patrik Martinsson, > > > > Sweden > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@xxxxxxxxxxxxxx > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- Kindly regards, Patrik Martinsson 0707 - 27 64 96 System Administrator Linux Genuine Happiness _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com