Re: Questions about using existing HW for PoC cluster

Will Dennis <wdennis@xxxxxxxxxxxx> · Mon, 28 Jan 2019 01:56:12 +0000

Thanks for contributing your knowledge to that book, Anthony - really enjoying it :)

I didn't mean to use the OS SSD for Ceph use - would buy a second SSD per server for that... I will take a look at SATA SSD prices, hopefully the smaller ones (>500MB) will be at an acceptable price so that I can buy 1 (or even 2) for each server. Love to run two for OS (md mirror) and then two more for Ceph use, but that's probably going to add up to more money than I'd want to ask for. I was going to check SMART on the existing SSD; since they are Intel SSDs, there's also an Intel tool ( https://www.intel.com/content/www/us/en/support/articles/000006289/memory-and-storage.html ) that I was going to use. In any case, will probably re-use the existing OS SSD for the new OS, and add in 1/2 new SSDs for Ceph per OSD server.

I also think a new SSD per mon would be doable; maybe 500GB - 1TB OK?

Usually for a storage system I'd be using some sort of Intel DC drives, but may go with Samsung 8xx Pro's for this to keep the price lower.

I mean to use CephFS on this PoC; the initial use would be to back up an existing ZFS server with ~43TB data (may have to limit the backed-up data depending on how much capacity I can get out of the OSD servers) and then share out via NFS as a read-only copy, that would give me some I/O speeds on writes and reads, and allow me to test different aspects of Ceph before I go pitching it as a primary data storage technology (it will be our org's first foray into SDS, and I want it to succeed.)

No way I'd go primary production storage with this motley collection of "pre-loved" equipment :) If it all seems to work well, I think I could get a reasonable budget for new production-grade gear.

>From what I've read so far in the book, and on the prior list posts, prolly do 2x10G bond to the common 10G switch that serves the cluster this would be a part of. Do the mon servers need 10G NICs too? If so, I may have to scrounge some 10Gbase-T NICs from other servers to give to them (they only have dual 1G NICs on the mobo.)

Thanks again!
Will

-----Original Message-----
From: Anthony D'Atri [mailto:aad@xxxxxxxxxxxxxx] 
Sent: Sunday, January 27, 2019 6:32 PM
To: Will Dennis
Cc: ceph-users
Subject: Re:  Questions about using existing HW for PoC cluster

> Been reading "Learning Ceph - Second Edition”

An outstanding book, I must say ;)

> So can I get by with using a single SATA SSD (size?) per server for RocksDB / WAL if I'm using Bluestore?

Depends on the rest of your setup and use-case, but I think this would be a bottleneck.  Some thoughts:

* You wrote that your servers have 1x 240GB SATA SSD that has the OS, and 8x 2TB SATA OSD drives.

** Sharing the OS with journal/metadata could lead to contention between the two
** Since the OS has been doing who-knows-what with that drive, check the lifetime used/remaining with `smartctl -a`.
** If they’ve been significantly consumed, their lifetime with the load Ceph will present will be limited.
** SSDs selected for OS/boot drives often have relatively low durability (DWPD) and may suffer performance cliffing when given a steady load.  Look up the specs on your model
** 8 OSDs sharing a single SSD for metadata is a very large failure domain.  If/when you lose that SSD, your lose all 8 OSDs and the host itself.  You would want to set the subtree limit to “host”, and not fill the OSDs past, say, 60% so that you’d have room to backfill in case of a failure not caught by the subtree limit
** 8 HDD OSDs sharing a single SATA SSD for metadata will be bottlenecked unless your workload is substantially reads.

* Single SATA HDD on the mons

** When it fails, you lose the mon
** I have personally seen issues due to HDDs not handling peak demands, resulting in an outage

The gear you list is fairly old and underpowered, but sure you could use it *as a PoC*.  For a production deployment you’d want different hardware.

> - Is putting the journal on a partition of the SATA drives a real I/O killer? (this is how my Proxmox boxes are set up)

With Filestore and HDDs, absolutely.  Even worse if you were to use EC.  There may be some coalescing of ops, but you’re still going to get a *lot* of long seeks, and spinners can only do a certain number of IOPs.  I think in the book I described personal experience with such a setup that even tickled a design flaw on the part of a certain HDD vendor.  Eventually I was permitted to get journal devices (this was pre-BlueStore GA), which were PCIe NVMe.  Write performance doubled.  Then we hit a race condition / timing issue in nvme.ko, but I digress...

When using SATA *SSD*s for OSDs, you have no seeks of course, and colocating the journals/metadata is more viable.

> - If YES to the above, then is a SATA SSD acceptable for journal device, or should I definitely consider PCIe SSD? (I'd have to limit to one per server, which I know isn't optimal, but price prevents otherwise…)

Optanes for these systems would be overkill.  If you would plan to have the PoC cluster run any appreciable load for any length of time, I might suggest instead adding 2x SATA SSDs per, so you could map 4x OSDs to each.  These would not need to be large:  upstream party line would have you allocate 80GB on each, though depending on your use-case you might well do fine with less, 2x 240GB class or even 2x 120GB class should suffice for PoC service.  For production I would advise “enterprise” class drives with at least 1 DWPD durability — recently we’ve seen a certain vendor weasel their durability by computing it incorrectly.

Depending on what you expect out of your PoC, and especially assuming you use BlueStore, you might get away with colocation, but do not expect performance that can be extrapolated for a production deployment.  

With the NICs you have, you could keep it simple and skip a replication/back end network altogether, or you could bond the LOM ports and split them.  Whatever’s simplest with the network infrastructure you have.  For production you’d most likely want LACP bonded NICs, but if the network tech is modern, skipping the replication network may be very feasible,  But I’m ahead of your context …

HTH
— aad

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com