> After years of using Ceph, we plan to build soon a new cluster bigger than > what > we've done in the past. As the project is still in reflection, I'd like to > have your thoughts on our planned design : any feedback is welcome :) > > > ## Requirements > > * ~1 PB usable space for file storage, extensible in the future > * The files are mostly "hot" data, no cold storage > * Purpose : storage for big files being essentially used on windows > workstations (10G access) > * Performance is better :) > > ## Global design > > * 8+3 Erasure Coded pool > * ZFS on RBD, exposed via samba shares (cluster with failover) > > > ## Hardware > > * 1 rack (multi-site would be better, of course...) > > * OSD nodes : 14 x supermicro servers > * 24 usable bays in 2U rackspace > * 16 x 10 TB nearline SAS HDD (8 bays for future needs) > * 2 x Xeon Silver 4212 (12C/24T) > * 128 GB RAM > * 4 x 40G QSFP+ > > * Networking : 2 x Cisco N3K 3132Q or 3164Q > * 2 x 40G per server for ceph network (LACP/VPC for HA) > * 2 x 40G per server for public network (LACP/VPC for HA) > * QSFP+ DAC cables > > > ## Sizing > > If we've done the maths well, we expect to have : > > * 2.24 PB of raw storage, extensible to 3.36 PB by adding HDD > * 1.63 PB expected usable space with 8+3 EC, extensible to 2.44 PB > * ~1 PB of usable space if we want to keep the OSD use under 66% to allow > loosing nodes without problem, extensible to 1.6 PB (same condition) > > > ## Reflections > > * We're used to run mons and mgrs daemons on a few of our OSD nodes, > without > any issue so far : is this a bad idea for a big cluster ? > > * We thought using cache tiering on an SSD pool, but a large part of the > PB is > used on a daily basis, so we expect the cache to be not so effective > and > really expensive ? > > * Could a 2x10G network be enough ? I would say yes, those slow disks will not deliver more anyway. This is going to be a relative "slow" setup with limited amount of read-caching - with 16 drives / 128GB memory it'll be a few GB per OSD for read caching - menaning that all read-and-write will hit the slow drives underneath. And that in a "double slow" fashion - where one write will hit 8 + 3 OSD's and wait for sync-ack back to the master - same with reads that will hit 8+3 OSD's before returning to the client. Workload depending - this may just work for you - but it is definately not fast. Suggestions for improvements: * Hardware raid with Battery Backed write-cache - will allow OSD to ack writes before hitting spinning rust. * More memory for OSD-level read-caching. * 3x replication instead of EC .. (we have all above in a "similar" setup ~1PB - 10 OSD - hosts). SSD-tiering pool (havent been there - but would like to test it out). -- Jesper _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx