Recommended underlining disk storage environment

freedman at FreeFormIT.com (Keith Freedman) · Wed, 03 Dec 2008 10:25:30 -0800

At 04:00 AM 12/3/2008, Stas Oskin wrote:
>Hi.
>
>Thanks for your detailed answers. I'd like to clarify several points:
>
>2008/12/3 Keith Freedman 
><<mailto:freedman at freeformit.com>freedman at freeformit.com>
>I'm not sure there's an official recommendation.
>I use XFS with much success.
>
>
>Is XFS suitable for massive writing / occasional reading?

XFS is more optimal than EXT3 or ReiserFS for write environments:
some useful information is here: 
http://www.ibm.com/developerworks/library/l-fs9.html

I'd pay close attention to the "Delayed allocation" section.

>I think the choice of underlying filesystem depends highly on the 
>types of data you'll be storing and how you'll be storing the info.
>If it's primarily read data, then a filesystem with journaling 
>capabilities may not provide much benefit.  If you'll have lots of 
>files in few directories then a filesystem with better large 
>directory metrix would be ideal, etc...  Gluster depends on the 
>underlying filesystem, and will work no matter what that filesystem 
>is provided it supports extended attributes.
>
>
>I'm going to store mostly large files (100+ MB), with massive 
>writing, and only occasional read operations.
>
>
>I've found XFS works great for most purposes.  If you're on Solaris, 
>I'd recommend ZFS.  but It seems people are fond of ReiserFS, but 
>you could certainly use EXT3 with extended attributes enabled and be 
>just fine most likely.
>
>
>I'm actually prefer to stay on Linux. How well XFS compares to EXT3 
>in the environment that I described?

They're all Linux filesystems, so that's not the issue.

>as for LVM.  again, this really depends what you want to do with the data.
>If you need to use multiple physical devices/partitions to present 
>just one to gluster you can do that and use LVM to manage your 
>resizing of the single logical volume.
>
>This was the first idea I though about, as I'm going to use 4 disks 
>per server.
>
>Alternatively, you could use gluster's Unify translator to present 
>one effective large/consolidated volume which can be made up of 
>multiple devices/partitions.
>
>
>I think I read somewhere in this mailing list that there is a 
>migration from Unity to DHT in GlusterFS (whichever it means) in the 
>coming 1.4. If Unity is the legacy approach, what is the relevant 
>solution for 1.4 (DHT)?

the approach is the same.  I belive the concept is that there's a 
translator that groups multiple smaller smaller filesystem pieces 
into a single representation.  Gluster lets you do this through the 
filesystem where LVM lets you do this through the block devices.

Personally, I'd go with LVM since it's likely easier to manage in the 
long run and gives you more flexibility.  You can grow your LVM 
volumen, and you can, if you go with XFS, dynamically resize your 
filesystem, and you wont have to make any changes to your gluster config.

>
>In this scenario, you could potentially have multiple underlying 
>configurations. You could Unify xfs, reiser, and ext3 filesystems 
>into one gluster filesystem.
>
>as for RAID, again, the faster and more appropriately configured the 
>underlying system for your data requirements, the better off you 
>will be.  If you're going to use gluster's AFR translator, then I'd 
>not bother with hardware raid/mirroring and just use RAID0 stripes, 
>however, if you have the money, and can afford to do RAID0+1, that's 
>always a huge benefit on read performance.  Of course, if you're in 
>a high write environment, then there's no real added value so it's 
>not worth doing.
>
>
>Couple of points here:
>1) Thanks to AFR, I actually don't need any fault-tolerant raid 
>(like mirror), so it's only recommeded in high-volume read 
>enviroments, which is not the case here. Is this correct?

you can use AFR as your fault-tolerance/mirror.  However, be aware 
that this means your "mirroring" wil be going at network speed.  If 
you have no need to have multiple servers with live replicated data, 
you'll be much better off performance wise using hardware 
mirroring.  However, if you want/need to have multiple servers 
serving identical data, then just use AFR and then you can live 
without hardware mirroring.

I'm not sure how gluster/AFR will perform with a very large file 
high-write environment.  We'll have to see what the gluster devs say 
about it, but what I can say is this:
In the event your AFR servers loose contact and then later have to 
auto-heal, gluster will have to move the entire large file, since it 
doesn't, as far as I know, have rsync like capabilities wherein it 
would only move the modified bits of the file over the network--I 
believe it just copies over the whole thing, so if this happens a 
lot, it will bog things down significantly.

>2) Isn't LVM (or GlusterFS own solution) much better then RAID 0 in 
>sense that if one of the disks go, the volume still continues to 
>work? This contrary to RAID where the whole volume goes down?

you're confused about what RAID means.  yes, RAID0 (striping), there 
is no redundancy.  RAID 1 (mirroring) provides redundancy and if one 
drive fails the volume still functions -- you can do this with 
hardware or, I believe, with LVM.  Then there's RAID0+1 (Striping & 
mirroring) which provides the performance benefit of striping with 
the high-availability of mirroring.

So whether or not you use LVM to do your raid or a hardware raid 
controller doesn't change anything.  RAID 0 you have a volume down in 
a failure, RAID 1 you can withstand a drive failure.

>3) Continuing 2, I think I actually meant JBOD - where you just 
>connect all the drives and make them look as a single device, rather 
>then stripping.

Right, however this presents the same issues as striping but without 
the performance benefit of striping.

Lets say you have AFR set up and you have a 4 disk stripe or 
concatenated (jbod) volume on each of 2 servers.
if you have a single drive failure on one server, that entire 
filesystem becomes unavailable.
When you repair your drive, you effectively have a blank empty 
filesystem now.  gluster/AFR will notice this and start auto-healing 
the entire filesystem (as each directory and file are accessed).  so 
in time you'll have copied over the entire filesystem over the network.

However, if you have a single server and you mirror your devices in a 
RAID1/0+1 config, then you loose a drive, you're filesystem is still 
running, replace the drive and the RAID software fixes everything.

AFR is much more efficient in high read environments since you can 
either distribute the load across multiple servers and specify a 
local read volume to insure a particular client is always using the 
fastest server (which could be it's own local brick, or a server on a 
lan when you're using AFR across a wan).

>If you could clarfy the recommended approach, it would be great.

so here's a summary:
IF you do NOT need to have more than one server serving the data 
(i.e, you're not going to replicate the data for DR purposes) I'd 
recommend you avoid AFR in gluster and instead configure RAID0+1 on 
your server.  You'd be better off using a hardware RAID controller 
with a large batter backed up cache, but you could use a software 
RAID (like LVM).

if you said you had a high read environment, I'd have suggested 2 
servers using AFR over a private high-speed network since that 
reduces your points of failure, but given the high write large file 
environment, AFR may become a bottleneck.  --again, if you NEED 
server redundancy, then AFR is your best option, but if you don't 
need it then it will just slow things down.

>this doesn't realy answer your question, but hopefully it helps.
>
>
>
>Thanks again for your help.
>
>Regards.