Re: Building a petabyte cluster from scratch

Darren Soothill <darren.soothill@xxxxxxxx> · Wed, 4 Dec 2019 10:59:31 +0000

Hi Fabien,

ZFS ontop of RBD really makes me shudder. ZFS expects to have individual disk devices that it can manage. It thinks it has them with this config but CEPH is masking the real data behind it.

As has been said before why not just use Samba directly from CephFS and remove that layer of complexity in the middle.

You havent mentioned any NVME in the servers for RocksDB and WAL. This can make a significant performance difference.

The memory you have in the servers will work for the number of drives you are planning on deploying but if you increase to the full complement of 24 drives you are likely to be tight on memory.
This will be a problem on servers where you are wanting to run other services if you go down the route of deploying them on the OSD's.

I don't seem to be able to find 4212 processors but I would go with 4215's if you are bound to intel as you get a higher clock speed although with less cores. You could deploy a single AMD processor per server as well.

Network is well over specified for what you are doing. As you are looking at deploying switches with it why not some 25G switches which have 40G or 100G uplink ports to connect to the core. 
For this use case I would look at 2 x 25G ports per server.  That would give you far more performance and 10/25 cards are pretty much the standard these days. 
Cabling up 4 QSFP cables for 14 servers is a back of rack cable nightmare. 56 thick cables just for networking.

10TB drives may not be the most cost efficient point to be at but you may need the drive count for performance.

I think your useable space calculations are off.

14 Servers x 16 Drives x 10TB is 2240TB as you have said.
So allowing for parity overhead you end up at 1.63PB

So couple of things here.

You cant fill a cluster to 100% and do you really mean PB or PiB

There is roughly a 7% disk space overhead for converting from disk manufactures TB to operating system TiB.

Filling a system past 85% will start giving you alerts as well.

So your 1.63 x 0.93 x 0.85 gets you to 1287TiB of useable space. Quite a bit less than the 1.63PB you have mentioned. How much space do you need?

I would also be considering what happens if I have a node failure and should I be reserving 1 nodes worth of capacity to allow for things like disk failures or a node failure. As the cluster gets bigger the likely hood of that sort of failure increases.

On 03/12/2019, 23:44, "Fabien Sirjean" <fsirjean@xxxxxxxxxxxx> wrote:

    Hi Ceph users !

    After years of using Ceph, we plan to build soon a new cluster bigger than what
    we've done in the past. As the project is still in reflection, I'd like to
    have your thoughts on our planned design : any feedback is welcome :)

    ## Requirements

     * ~1 PB usable space for file storage, extensible in the future
     * The files are mostly "hot" data, no cold storage
     * Purpose : storage for big files being essentially used on windows workstations (10G access)
     * Performance is better :)

    ## Global design

     * 8+3 Erasure Coded pool
     * ZFS on RBD, exposed via samba shares (cluster with failover)

    ## Hardware

     * 1 rack (multi-site would be better, of course...)

     * OSD nodes : 14 x supermicro servers
       * 24 usable bays in 2U rackspace
       * 16 x 10 TB nearline SAS HDD (8 bays for future needs)
       * 2 x Xeon Silver 4212 (12C/24T)
       * 128 GB RAM
       * 4 x 40G QSFP+

     * Networking : 2 x Cisco N3K 3132Q or 3164Q
       * 2 x 40G per server for ceph network (LACP/VPC for HA)
       * 2 x 40G per server for public network (LACP/VPC for HA)
       * QSFP+ DAC cables

    ## Sizing

    If we've done the maths well, we expect to have :

     * 2.24 PB of raw storage, extensible to 3.36 PB by adding HDD
     * 1.63 PB expected usable space with 8+3 EC, extensible to 2.44 PB
     * ~1 PB of usable space if we want to keep the OSD use under 66% to allow
       loosing nodes without problem, extensible to 1.6 PB (same condition)

    ## Reflections

     * We're used to run mons and mgrs daemons on a few of our OSD nodes, without
       any issue so far : is this a bad idea for a big cluster ?

     * We thought using cache tiering on an SSD pool, but a large part of the PB is
       used on a daily basis, so we expect the cache to be not so effective and
       really expensive ?

     * Could a 2x10G network be enough ?

     * ZFS on Ceph ? Any thoughts ?

     * What about CephFS ? We'd like to use RBD diff for backups but it looks
       impossible to use snapshot diff with Cephfs ?

    Thanks for reading, and sharing your experiences !

    F.

    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx