Looking for recommendations, large setup

Steffen Grunewald <steffen.grunewald@xxxxxxxxxx> · Fri, 30 May 2008 11:47:52 +0200

Hi,

after a long time of hesitation, I have decided to give glusterfs a try. 
Perhaps not immediately, but in the next few weeks.

What I have as a starting point is:
- some nodes with a spare disk (single partition, xfs formatted, 750GB)
- Debian Etch
- the ability to build a few packages (lmello seems to be silent)
Now I'm looking for recommendations.
At the beginning, I'd like to put some emphasis on data availability, even
at the expense of disk space, and the option to reduce the redundancy later.
Note that I'm talking about quite some TB, and the need to find out which
files have been harmed should there be a major hardware problem. (There
can't be a full backup, but at least I'd be able to fill in missing pieces;
but I have to know who they are.)

I've been thinking about the following:
- have sets of n machines, repeated m times
- make "mirrors" from the corresponding machines in each set
	that is, AFR over machine 1,(n+1),(2*n+1), ... , ((m-1)*n+1) etc.
	- giving me kind of RAID-1 redundancy
- unify all these AFRs
	- resulting in a RAID-10 setup
- instead of (block) striping, I'd favour a round-robin scheduling so
	that each file would be written to the next AFR
- if this "single-file rr" could be limited to a filename pattern, that
	would be a nice feature

My questions:
(1) I have installed the "attr" package, and in a writable directory, I can
	do the "setfattr" test describes in "Best Practices" in the wiki.
	May I safely assume that extended attributes won't be an issue then,
	and I don't need a mount option / mkfs.xfs option?
(2) What about namespace? This shows up from time to time, and if necessary,
	I'd like to have one on special servers (with hardware RAID etc.) 
	but still in AFR setup.
(2a) How many ns volumes can I AFR without harming performance?
(2b) What are the disk space requirements for a ns volume? Rules of thumb
	to derive them from file counts etc.?
(3) How does the setup outlined above scale to large values of n*m?
	(I'm thinking along the lines of n=200, m=3; with the option to
	drop to m=2, n=300.)
(3a) Are there setups in the wild with more than 100 storage/posix volumes,
	and what's your experience with such a large farm?
(4) What about fuse? Will the fuse module that comes with the latest kernel
	(2.6.25.4) do for a start?
(4a) Would it be possible to place the patched fuse kernel module under 
	module-assistant's control (so I don't have to build a new package
	for the package repository each time the kernel gets updated?

and  - somewhat unrelated -:
(5) I can imagine to convert a couple of (RAID-6) storage servers to glusterfs
	as well. These are already "mirrored" (by hand), and it should be 
	easy to combine them into AFR pairs, then unify the AFRs, and supply
	the whole volume as read-only (! files are owned by a special user)
	Are there detailed instructions how to achieve this without data loss?
	Some time ago I had found some hints how to use "cpio" to duplicate
	the directory structure among the individual storage volumes, is
	this still necessary?
(5a) How do I add a ns volume in this case?

Oh well, that's quite a lot of questions. I'm not in a hurry yet :) so feel
free to answer (part of) them when your time allows.

Thanks in advance,
 Steffen

-- 
Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam
Cluster Admin * http://pandora.aei.mpg.de/merlin/ * http://www.aei.mpg.de/
* e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298}
No Word/PPT mails - http://www.gnu.org/philosophy/no-word-attachments.html