Re: Building a petabyte cluster from scratch

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> 
> ## Requirements
> 
> * ~1 PB usable space for file storage, extensible in the future
> * The files are mostly "hot" data, no cold storage
> * Purpose : storage for big files being essentially used on windows workstations (10G access)
> * Performance is better :)
> 
> 
> ## Global design
> 
> * 8+3 Erasure Coded pool

EC performance for RBD is going to be mediocre at best, esp. on spinners.

> * ZFS on RBD, exposed via samba shares (cluster with failover)

Why ZFS? Mind you I like ZFS, but layering it on top of RBD is more overhead and complexity.

>   * 128 GB RAM

Nowhere near enough.   You’re going to want 256 at the very least.

> * Networking : 2 x Cisco N3K 3132Q or 3164Q
>   * 2 x 40G per server for ceph network (LACP/VPC for HA)
>   * 2 x 40G per server for public network (LACP/VPC for HA)

Don’t bother with a replication network.

> * We're used to run mons and mgrs daemons on a few of our OSD nodes, without
>   any issue so far : is this a bad idea for a big cluster ?

Contention for resources can lead to a vicious circle.  Failure/maint of mon/mgr/OSD at the same time can be ugly.  Put your mons on something cheap, 5 of them or 3 if you must.

> * We thought using cache tiering on an SSD pool, but a large part of the PB is
>   used on a daily basis, so we expect the cache to be not so effective and
>   really expensive ?

Cache tiering is deprecated at best.  Not a good idea to invest in it.  If you’re going to use SSDs, there are better ways.

> * Could a 2x10G network be enough ?

Yes.

> * ZFS on Ceph ? Any thoughts ?

ZFS is great, but unless you have a specific need, it sounds like a lot of overhead and complexity.

> * Hardware raid with Battery Backed write-cache - will allow OSD to ack writes before hitting spinning rust.

Disagree.  See my litany from a few months ago.  Use a plain, IT-mode HBA.  Take the $$ you save and put it toward building your cluster out of SSDs instead of HDDs.  That way you don’t have to mess with the management hassles of maintaining and allocating external WAL+DB partitions too.

> 3x replication instead of EC

This.  The performance of EC RBD vols will likely disappoint you, esp on spinners.  Having suffered 3R RBD on LFF spinners, I predict that you would also be unhappy unless your use-case is only archival / backups or some other cold, latency-tolerance workload.














_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux