Inviting comments on my plans

gluster at elyograg.org (Shawn Heisey) · Sat, 17 Nov 2012 11:04:33 -0700

I am planning the following new gluster 3.3.1 deployment, please let me 
know whether I should rethink any of my plans.  If you don't think what 
I'm planning is a good idea, I will need concrete reasons.

Dell R720xd servers with two internal OS drives and 12 hot-swap external 
3.5 inch bays.  Fedora 18 alpha, to be upgraded to Fedora 18 when it is 
released.

2TB simple LVM volumes for bricks.
A combination of 4TB disks (two bricks per drive) and 2TB disks.
Distributed-Replicated volume, replica 2.
The initial two servers will have 4TB disks.  As we free up existing 2TB 
SAN drives, additional servers will be added with those drives.

Brick filesystems will be mounted under /bricks on each server.  
Subdirectories within those filesystems will be used as the actual brick 
paths on create/add commands.

Now for the really controversial part of my plans: Left-hand brick 
filesystems (listed first in each replica set) will be XFS, right-hand 
bricks will be BTRFS.  The idea here is that we will have one copy of 
the volume on a fully battle-tested and reliable filesystem, and another 
copy of the filesystem stored in a way that we can create periodic 
snapshots for last-ditch "oops" recovery.  Because of the distributed 
nature of the filesystem, using those snapshots will not be 
straightforward, but it will be POSSIBLE.

To answer another question that I'm sure will come up: Why no RAID?  
There are a few reasons:

* We will not be able to fill all drive bays initially.
Although Dell lets you add disks and grow the RAID volume, and Linux 
probably lets you grow the filesystem once that's done, it is a long 
drawn out process, with horrible performance penalties while it is 
happening.  By putting bricks on the disks directly, we do not have to 
deal with this.

* Performance.
RAID 5/6 comes with a severe penalty on performance during sustained 
writes -- writing more data than will fit in your RAID controller's 
cache memory.  Also, if you have a failed disk, all performance is 
greatly impacted during the entire rebuild process, which for a 4TB disk 
is likely to take a few days.

* Disk space loss.
With RAID5 on 4TB disks, we would lose 4TB of disk space for each server 
pair.  With RAID6, that would be 8TB per server pair.  For a fully 
populated server, that means 40TB instead of 48TB.  The bean counters 
are technically clueless, but they understand those numbers.

Is this a reasonable plan?  If not, why?

Thanks,
Shawn