On Jan 3, 2008 3:09 AM, Kevan Benson <kbenson@xxxxxxxxxxxxxxx> wrote: > LI Daobing wrote: > > 4. a new AFR model: > > > > Currently, if the AFR have 3 child xlators, and each xlator connect to > > a distinct machine. Then the write speed of this AFR is only 33% of the > > capacity of the network. > > > > Consider a different model, the AFR send data to machine1, machine1 send > > the data to machine2 immediately and then write the data to > > disk. Machine2 also send data to machine3 immediately and then write > > data to disk. Under this model, we can increase the write speed to 3 > > times of the previous model (if your switch is good enough and your > > network support full duplex). > > This can be achieved through multiple AFR definitions, different for > each server, that chain the data. I've discussed it once or twice on > the list before, but never implemented (time doesn't permit me to follow > up on lots of ideas). No need for a new translator, which would be > complicated (see below). In your model, if a middle node out of work, then all the following nodes out of work. (isn't it?) I think this is very dangerous for afr. And more, there is a comment near the end of definition of afr_sync_ownership_permission. This comment said that afr on afr wont work. This function is triggered by afr_lookup_cbk when self_heal is needed. And self_heal is very important for afr. Any one can help clear whether afr on afr has problem? > > > In more detail, we need two new kinds of xlators. The first one is the > > combination of the AFR and client-protocol (called *safr*). The second > > one is similar with the `server-protocol'(called *sserver*). > > > > The machine1, machine2, machine3 is set in the option of safr. And safr > > maintain an active-machine list. When safr receive a writev command(or > > other commands), it pick a machine from the active-machine list(for > > exmaple, machine1). then send the data and a list "[machine2, machine3]" > > to machine1. machine1 forward data and list "[machine3]" to machine2 > > immediately, machine2 also forward data and an empty list to > > machine3. Machine1, machine2, machine3 also write the data to the disk > > when sending data. > > > > If any machine is down, the afr just remove it from the active-machine > > list. And add it when the machine is up again. > > > > This model is a little far from the current framework, but I think it's > > a good idea to write at 100 MB/s instead of 30+ MB/s in a gigabyte > > network. > > This all assumes a lot about the target network, which is a bad thing > when trying to be as flexible as GlusterFS. For example, machine1 and > machine2 may not have an optimized path to each other. They may be in > different subnets, off different switches, or in different geographical > locations, etc. > > The only thing your translators provide that isn't already available > through chained translators is automatic reconfiguration of the chain > members when a server drops out, which is a good feature, but I would > rather just add cheap redundant hardware to boost speed, such as extra > gigabit NICs and switches to allow dedicated paths between select > systems. Also, maybe the new switch translator can be added to what's > already available to achieve what you want, I'm still fuzzy on exactly > what it can be used for. It's a good idea to buy more and better hardware. But it's better if we can achive this by software. :) > > > This model is similar with the model in google file system, you can > > check the figure 2 in a paper of google file system[1]. I put a copy of > > this figure at [2]. > > > > [1] http://labs.google.com/papers/gfs-sosp2003.pdf > > [2] http://picasaweb.google.com/lidaobing/Public/photo#5150803669289886370 > > Google has a advantage of designing for a specific usage, which makes > some design choices better for them than for other projects. > > > PS, should I copy this feature request to wiki? Or it's ok to only put > > it here? > > OK, now that I've done my best to tear down your proposal and say why > it's not needed, here's where I put my disclaimer: > > 1) I'm not a dev, and I haven't really looked into the code, so I don't > know how easy or hard your proposal is to actually implement. > 2) I'm just one person, and even though *I* may think it's not needed, > others may differ on this point. > Thanks for your comment. -- Best Regards, LI Daobing