NAS on RBD

chibi@xxxxxxx (Christian Balzer) · Tue, 9 Sep 2014 17:33:46 +0900

Hello,

On Tue, 9 Sep 2014 17:05:03 +1000 Blair Bethwaite wrote:

> Hi folks,
> 
> In lieu of a prod ready Cephfs I'm wondering what others in the user
> community are doing for file-serving out of Ceph clusters (if at all)?
> 
> We're just about to build a pretty large cluster - 2PB for file-based
> NAS and another 0.5PB rgw. For the rgw component we plan to dip our
> toes in and use an EC backing pool with a ~25TB (usable) 10K SAS + SSD
> cache tier.
> 
> For the file storage we're looking at mounting RBDs (out of a standard
> 3-replica pool for now) on a collection of presentation nodes, which
> will use ZFS to stripe together those RBD vdevs into a zpool which we
> can then carve datasets out of for access from NFS & CIFS clients.
> Those presentation servers will have some PCIe SSD in them for ZIL and
> L2ARC devices, and clients will be split across them depending on what
> ID domain they are coming from. Presentation server availability
> issues will be handled by mounting the relevant zpool on a spare
> server, so it won't be HA from a client perspective, but I can't see a
> way to getting this with an RBD backend.
> 
> Wondering what the collective wisdom has to offer on such a setup...
> 
I have nearly no experience with ZFS, but I'm wondering why you'd pool
things on the level when Ceph is already supplying a redundant and
resizeable block device. 

Wanting to use ZFS because of checksumming, which is sorely missing in
Ceph, I can understand. 

Using a CoW filesystem on top of RBD might not be a great idea either,
since it is sparsely allocated, performance is likely to be bad until all
"blocks" have been actually allocated. Maybe somebody with experience in
that can pipe up. 

Something that ties into the previous point, kernel based RBD currently
does not support TRIM, so even if you were to use something other than
ZFS, you'd never be able to get that space back.

There are HA NFS cluster examples based on pacemaker (and usually backed
up by DRBD or a SAN) on the net and I think I've seen people here doing
things based on that, too.

I would start with that and coerce to Ceph developers to get that TRIM
support into the kernel after thinking about it for 2 years or so.

Another scenario might be running the NFS heads on VMs, thus using librbd
an having TRIM (with the correct disk device type). And again use
pacemaker to quickly fail over things.

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/