gluster-users-bounces at gluster.org wrote on 01/24/2012 08:04:52 AM: > On Mon, Jan 23, 2012 at 03:54:45PM -0600, Greg_Swift at aotx.uscourts.gov wrote: > > Its been talked about a few times on the list in abstract but I can give > > you one lesson learned from our environment. > > > > the volume to brick ratio is a sliding scale. you can can have more of > > one, but then you need to have less of the other. > > This is interesting, because the examples aren't entirely clear in the > documentation. At > http://download.gluster.com/pub/gluster/glusterfs/3.2/Documentation/ > IG/html/sect-Installation_Guide-Installing-Source.html > it says: > > "Note > > You need one open port, starting at 38465 and incrementing sequentially for > each Gluster storage server, and one port, starting at 24009 for each > bricks. This example opens enough ports for 20 storage servers and three > bricks." > > [presumably means three bricks *per server*?] > > with this example: > > $ iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp > --dport 24007:24011 -j ACCEPT > $ iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp > --dport 111 -j ACCEPT > $ iptables -A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp > --dport 111 -j ACCEPT > $ iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp > --dport 38465:38485 -j ACCEPT > $ service iptables save > $ service iptables restart > > So there's one range for bricks, and one range for servers (here they seem > to have given enough for 21 servers) > > Now you point out that the number of volumes needs to be considered as well, > which makes sense if each brick can only belong to one volume. > > > 24 bricks per node per volume > > 100 volumes > > --------- > > = 2400 running processes and 2400 ports per node > > So that's 2400 bricks per node. > > It seems to me there are a couple of ways I could achieve this: > > (1) the drive mounted on /mnt/sda1 could be exported as 100 bricks > /mnt/sda1/chunk1 > /mnt/sda1/chunk2 > ... > /mnt/sda1/chunk100 > > and repeated for each of the other 23 disks in that node. > > If I build the first volume comprising all the chunk1's, the second volume > comprising all the chunk2's, and so on, then I'd have 100 volumes across all > the disks. Furthermore, I think this would allow each volume to grow as much > as it wanted, up to the total space available, is that right? Yes, you understand it correctly. But take into consideration that 2400 is a lot. We had a hard time running at 1600 and had to dropped all the way back to 200ish. We did not test anything in between. I was not encouraging the 2400 count :) > > (2) I could organise the storage on each server into a single RAID block, > and then divide it into 2400 partitions, say 2400 LVM logical volumes. > > Then the bricks would have to be of an initial fixed size, and each volume > would not be able to outgrow its allocation without resizing its brick's > filesystems (e.g. by growing the LVM volumes). Resizing a volume would be > slow and painful. > > Neither looks convenient to manage, but (2) seems worse. 2 is definitely worse. > > More process/ports means more potential for ports in use, connectivity > > issues, file use limits (ulimits), etc. > > > > thats not the only thing to keep in mind, but its a poorly documented one > > that burned me so :) > > So if you don't mind me asking, what was your solution? Did you need large > numbers of volumes in your application? We have to have large numbers of volumes (~200). Quick run down to give context. Our nodes would have around 128TB of local storage from several 32TB raid sets. We started with ext4, so had a 16TB maximum. So we broke it down into nice even chunks of 16TB, thus 8 file systems. Our first attempt was ~200 volumes all using the 8 bricks per node (thus 1600 process/ports) so that we did not have to concern ourselves as much with disk utilization by giving every volume every disk. We had issues, and Gluster recommended reducing our process/port count. First we dropped down to only using 1 brick per volume per node, but this left us in a scenario of managing growth, which we were trying to avoid (could get very messy very fast by having many volumes all configured differently). So we determined to move to XFS to reduce from 8 partitions down to 2 LVs. Each would be 64TB each (consistency was a driving factor, that sadly didn't maintain). We were lucky and only had several system on the solution at this time and moved them over onto the first set of disks allowing us to create an LV over one set of partitions. We formatted it with XFS and started puppting the new volumes on it. Unfortunately the ramp up at this time made it difficult for us to move the data from the existing ext4 volumes to the XFS volumes. We then ran into some performance issues and found we had not tuned the XFS enough, which also deterred us from pushing forward with the move. This left us with 5 bricks, 4 smaller ext4 volumes and 1 large XFS. We still have to manage disk utilization, but are working towards resolving that as well. Our plan is still to consolidate as much as possible to reduce gluster process/port overheard and disk utilization administration overhead. > Aside: it will be interesting to see how gluster 3.3's object storage API > handles this (from the documentation, it looks like you can create many > containers within the same volume) The subvolume or container concept is something we've discussed with Gluster and we are going to look into it. I am very curious as well. > > The other concern I have regarding making individual drives be bricks is how > to handle drive failures and replacements. > ..snip.. > > Is that correct, or have I misunderstood? Is there some other way to fail or > disable a single brick or drive, whilst still leaving access to its replica > partner? we are not using the replica feature so I am not sure what the answers to this would be. -greg