I'm new to Gluster, and have some questions

hsanson at gmail.com (Horacio Sanson) · Fri, 22 Oct 2010 09:55:11 +0900

I am just starting playing with Gluster but I think I can give you some 
answers from my experience.

On Thursday 21 October 2010 17:09:32 Rudi Ahlers wrote:
> Hi all,
> 
> I'm considering setting up Gluster, and have a few questions if you don't
> mind.
> 
> 
> 1. Which option is better? I already have a few CentOS 5.5. server
> setup. Would it be better to just install GlusterFS, or to install
> Gluster Storage Platform from scratch? How / where can I see a full
> comparison between the 2? Are there any performance / management
> benefits in choosing the one of the other?
> 

The Gluster Storage Platform requires GlusterFS. The platform is a complete OS 
(linux Fedora) + GlusterFS + Web Management in a single package that can be 
installed via USB in a few minutes.  It is supposed to simplify installation, 
setup and management of GlusterFS clusters but.... I could not get it to work 
properly.

I was unable to add new servers. Everytime I pressed the add new server button 
I got an error saying "Could not retrive installer ip address". And since the 
platform is relative new there is near zero documentation/issue reports about 
it.  Also adding the servers/volumes via command line never reflected to the 
web based GUI

So I installed Ubuntu 10.10 LTS and GlusterFS 3.1 via source code and handling 
the server/volumes etc via the new command line is a breeze.

> 2. I need reliability and speed. From what I understand, I could setup
> 2 servers to work similar to software RAID1 (mirroring). Is it also
> correct to assume that I could use 4 servers in a RAID10 / 1+0 type
> setup? But then obviously serverA & serverB will be mirrored, and
> serverC & serverD together? What happens to the data? Does it get
> filled randomly between the 2 sets of servers, or does it get put onto
> serverA & B first, till it's full then move over to C & D?
> 
I only have two servers for testing. What you setup are volumes and each 
volume can be configured depending on your needs. This is what I understand so 
far:

Distributed volume:  Aggregates the storage of several directories (bricks in       
gluster terms) among several computers. The benefit is that you  can 
grow/shrink the volume as you please. The bad part is that  this offers no 
performance/reliability guarantees as files are  stored randomly among the 
disks in the volume.

Replicated volume: Requires minimum 2 bricks in separate servers. All files are 
replicated among the bricks. How many replicas can be configured at volume 
creation. Has all the benefits of a Distributed volume plus fail resilience.

Stripe volume: Requires minimum 2 bricks in separate servers. All files are 
splitted in stripes and these stripes are distributed among the bricks of the 
volume. How many stripes and which size is configured on volume creation. Has 
all the benefits of Replicated volume plus reliability and can improve read 
performance for large files as the read is distributed among several machines.

> 3. Has anyone noticed any considerable differences in using 1x 1GB NIC
> & 2x 1GB NIC's bonded together? Or should I rather use a Quad port NIC
> if / where possible?
> 
> 4. How do clients (i.e. users) connect if I want to give them normal
> FTP / SMB / NFS access? Or do I need to mount the exported Gluster to
> another Linux server first which runs these services already?
> 
Gluster 3.1 has a native NFS v3 implementation so you can mount any Gluster 
volume as a normal NFS mount. For SMB you need to configure samba to share the 
volume and you can easily access the files on any of the bricks via SCP or FTP 
if you have an SSH or FTP server configured. For linux the recommended way is 
to use the glusterfs module to mount as a gluster file system.

> 5. If there's 10 Gluster servers, for example, with a lot of data
> spread out across them. How do the clients connect, exactly? I.e. do
> they all connect to a central server which then just "fetches and
> delivers" the content to the clients, or do the client's connect
> directly to the specific server where their content is? i.e. is the
> network traffic split evenly across the servers, according to where
> the data is stored?
> 
This is also something I would like to know. When connecting clients I use the 
command

   mount -t [nfs|glusterfs]  <ip-address>:<volume-name> /mount/point

where ip-address is the IP of any of the servers that have the volume 
configured. It is not clear to me how the reliability part works here. If I 
disconnect the server with that ip-address I loose access to the files. True 
that the files are still accessible via other servers but I need to manually 
set the mount to point to another server which is not exactly high-
availability.

> tia :)

-- 
regards,                                                                                                                                                                                                       
Horacio Sanson