Thanks a lot for your insightful reply; it really clarifies a lot of things. I think the DHT-AFR-DHT configuration makes a lot of sense. - Wei Dong Harald St?rzebecher wrote: > Hi, > > 2009/7/28 Wei Dong <wdong.pku at gmail.com>: > >> Hi All, >> >> We've been using GlusterFS 2.0.1 on our lab cluster to host a large number >> of small images for distributed processing with Hadoop and it has been >> working fine without human intervention for a couple of months. Thanks for >> the wonderful project -- it's the only freely available cluster filesystem >> that fits our needs. >> >> What keeps bothering me is the extremely high flexibility of ClusterFS. >> There's simply so many ways to achieve the same goal that I don't know >> which is the best. So I'm writing to ask if there are some general >> guidelines of configuration to improve both data safety and performance. >> > > AFAIK, there are some general guidelines in the GlusterFS documentation. > IMHO, sometimes it takes careful reading or some experimentation to find them. > Some examples have been discussed on the mailing list. > > >> Specifically, we have 66 machines (in two racks) with 4 x 1.5TB disks / >> machine. We want to aggregate all the available disk space into a single >> shared directory with 3 replications.. Following are some of the potential >> configurations. >> >> * Each node exports 4 directories, so there are 66x4 = 264 directories to >> the client. We then first group those directories into threes with AFR, >> making 88 replicated directories, and then aggregate them with DHT. When >> configuring AFR, we can either make the three replicates on different >> machines, or two on the same machine and the third on another machine. >> > > I'd put the three replicates on three different machines - three > machines are less likely to fail than just two. > > One setup on my list of setups to evaluate would be a DHT - AFR - DHT > configuration. > - aggregate the four disks on each server to a single volume, export > only that volume > - on the clients, group those 66 volumes into threes with AFR and > aggregate with DHT > That would reduce the client config file from 264 imported volumes to > 66, reducing complexity of the configuration and the number of open > connections > > >> * Each node first aggregates three disks (forget about the 4th for >> simplicity) and exports a replicated directory. The client side then >> aggregates the 66 single replicated directory into one. >> > > That might mean that access to some of the data is lost if one node > fails - not what I'd accept from a replicated setup. > > >> * When grouping the aggregated directories on the client side, we can use >> some kind of hierarchy. For example the 66 directories are first aggregated >> into groups of N each with DHT, and then the 66/N groups are again >> aggregated with DHT. >> > > Doesn't that just make the setup more complicated? > > >> * We don't do the grouping on the client side. Rather, we use some >> intermediate server to first aggregate small groups of directories with DHT >> and export them as a single directory. >> > > The network connection of the intermediate server might become a > bottleneck, limiting performance. > The intermediate server might become a single point of failure. > > >> * We can also put AFR after DHT >> ...... >> >> To make things more complicated, the 66 machines are separated into two >> racks with only 4-gigabit inter-rack connection, so all the directories >> exported by the servers are not equal to a particular client. >> > > A workaround might be to create two intermediate volumes that each > perform better when accessed from on one of the racks and use NUFA to > create the single volume. > > Keeping replicated data local to one rack would improve performance, > but the failure of one complete rack (e.g. power line failure, > inter-rack networking) would block access to half of your data. > > Getting a third rack and much faster inter-rack connection would > improve performance and protect better against failures - just place > the three copies of a file on different racks. ;-) > > >> I'm wondering if someone on the mailing list could provide me with some >> advice. >> > > Plan, build, test ... repeat until satisfied :-) > > Optional: share your solution, with benchmarks > > > IMHO, there won't be a single "best" solution. > > > Harald St?rzebecher >