Best practices?

Greg_Swift at aotx.uscourts.gov (Greg_Swift at aotx.uscourts.gov) · Tue, 24 Jan 2012 09:11:01 -0600

gluster-users-bounces at gluster.org wrote on 01/24/2012 08:04:52 AM:

> On Mon, Jan 23, 2012 at 03:54:45PM -0600, Greg_Swift at aotx.uscourts.gov
wrote:
> > Its been talked about a few times on the list in abstract but I can
give
> > you one lesson learned from our environment.
> >
> > the volume to brick ratio is a sliding scale.  you can can have more of
> > one, but then you need to have less of the other.
>
> This is interesting, because the examples aren't entirely clear in the
> documentation. At
> http://download.gluster.com/pub/gluster/glusterfs/3.2/Documentation/
> IG/html/sect-Installation_Guide-Installing-Source.html
> it says:
>
> "Note
>
> You need one open port, starting at 38465 and incrementing sequentially
for
> each Gluster storage server, and one port, starting at 24009 for each
> bricks.  This example opens enough ports for 20 storage servers and three
> bricks."
>
> [presumably means three bricks *per server*?]
>
> with this example:
>
> $ iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp
> --dport 24007:24011 -j ACCEPT
> $ iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp
> --dport 111 -j ACCEPT
> $ iptables -A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp
> --dport 111 -j ACCEPT
> $ iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp
> --dport 38465:38485 -j ACCEPT
> $ service iptables save
> $ service iptables restart
>
> So there's one range for bricks, and one range for servers (here they
seem
> to have given enough for 21 servers)
>
> Now you point out that the number of volumes needs to be considered as
well,
> which makes sense if each brick can only belong to one volume.
>
> > 24 bricks per node per volume
> > 100 volumes
> > ---------
> > = 2400 running processes and 2400 ports per node
>
> So that's 2400 bricks per node.
>
> It seems to me there are a couple of ways I could achieve this:
>
> (1) the drive mounted on /mnt/sda1 could be exported as 100 bricks
> /mnt/sda1/chunk1
> /mnt/sda1/chunk2
> ...
> /mnt/sda1/chunk100
>
> and repeated for each of the other 23 disks in that node.
>
> If I build the first volume comprising all the chunk1's, the second
volume
> comprising all the chunk2's, and so on, then I'd have 100 volumes across
all
> the disks. Furthermore, I think this would allow each volume to grow as
much
> as it wanted, up to the total space available, is that right?

Yes, you understand it correctly. But take into consideration that 2400 is
a lot.  We had a hard time running at 1600 and had to dropped all the way
back to 200ish.  We did not test anything in between.  I was not
encouraging the 2400 count :)

>
> (2) I could organise the storage on each server into a single RAID block,
> and then divide it into 2400 partitions, say 2400 LVM logical volumes.
>
> Then the bricks would have to be of an initial fixed size, and each
volume
> would not be able to outgrow its allocation without resizing its brick's
> filesystems (e.g.  by growing the LVM volumes).  Resizing a volume would
be
> slow and painful.
>
> Neither looks convenient to manage, but (2) seems worse.

2 is definitely worse.

> > More process/ports means more potential for ports in use, connectivity
> > issues, file use limits (ulimits), etc.
> >
> > thats not the only thing to keep in mind, but its a poorly documented
one
> > that burned me so :)
>
> So if you don't mind me asking, what was your solution? Did you need
large
> numbers of volumes in your application?

We have to have large numbers of volumes (~200).  Quick run down to give
context.

Our nodes would have around 128TB of local storage from several 32TB raid
sets.  We started with ext4, so had a 16TB maximum.  So we broke it down
into nice even chunks of 16TB, thus 8 file systems. Our first attempt was
~200 volumes all using the 8 bricks per node (thus 1600 process/ports) so
that we did not have to concern ourselves as much with disk utilization by
giving every volume every disk.  We had issues, and Gluster recommended
reducing our process/port count.

First we dropped down to only using 1 brick per volume per node, but this
left us in a scenario of managing growth, which we were trying to avoid
(could get very messy very fast by having many volumes all configured
differently).  So we determined to move to XFS to reduce from 8 partitions
down to 2 LVs.  Each would be 64TB each (consistency was a driving factor,
that sadly didn't maintain).  We were lucky and only had several system on
the solution at this time and moved them over onto the first set of disks
allowing us to create an LV over one set of partitions.  We formatted it
with XFS and started puppting the new volumes on it.  Unfortunately the
ramp up at this time made it difficult for us to move the data from the
existing ext4 volumes to the XFS volumes. We then ran into some performance
issues and found we had not tuned the XFS enough, which also deterred us
from pushing forward with the move.  This left us with 5 bricks, 4 smaller
ext4 volumes and 1 large XFS.  We still have to manage disk utilization,
but are working towards resolving that as well.  Our plan is still to
consolidate as much as possible to reduce gluster process/port overheard
and disk utilization administration overhead.

> Aside: it will be interesting to see how gluster 3.3's object storage API
> handles this (from the documentation, it looks like you can create many
> containers within the same volume)

The subvolume or container concept is something we've discussed with
Gluster and we are going to look into it.  I am very curious as well.

>
> The other concern I have regarding making individual drives be bricks is
how
> to handle drive failures and replacements.
>
..snip..
>
> Is that correct, or have I misunderstood? Is there some other way to fail
or
> disable a single brick or drive, whilst still leaving access to its
replica
> partner?

we are not using the replica feature so I am not sure what the answers to
this would be.

-greg