Re: Feature requests of glusterfs

"LI Daobing" <lidaobing@xxxxxxxxx> · Thu, 3 Jan 2008 10:16:10 +0800

On Jan 3, 2008 3:09 AM, Kevan Benson <kbenson@xxxxxxxxxxxxxxx> wrote:
> LI Daobing wrote:
> > 4. a new AFR model:
> >
> > Currently, if the AFR have 3 child xlators, and each xlator connect to
> > a distinct machine. Then the write speed of this AFR is only 33% of the
> > capacity of the network.
> >
> > Consider a different model, the AFR send data to machine1, machine1 send
> > the data to machine2 immediately and then write the data to
> > disk. Machine2 also send data to machine3 immediately and then write
> > data to disk. Under this model, we can increase the write speed to 3
> > times of the previous model (if your switch is good enough and your
> > network support full duplex).
>
> This can be achieved through multiple AFR definitions, different for
> each server, that chain the data.  I've discussed it once or twice on
> the list before, but never implemented (time doesn't permit me to follow
> up on lots of ideas).  No need for a new translator, which would be
> complicated (see below).

In your model, if a middle node out of work, then all the following
nodes out of work. (isn't it?) I think this is very dangerous for afr.

And more, there is a comment near the end of definition of
afr_sync_ownership_permission. This comment said that afr on afr wont
work. This function is triggered by afr_lookup_cbk when self_heal is
needed. And self_heal is very important for afr.

Any one can help clear whether afr on afr has problem?

>
> > In more detail, we need two new kinds of xlators. The first one is the
> > combination of the AFR and client-protocol (called *safr*). The second
> > one is similar with the `server-protocol'(called *sserver*).
> >
> > The machine1, machine2, machine3 is set in the option of safr. And safr
> > maintain an active-machine list. When safr receive a writev command(or
> > other commands), it pick a machine from the active-machine list(for
> > exmaple, machine1). then send the data and a list "[machine2, machine3]"
> > to machine1. machine1 forward data and list "[machine3]" to machine2
> > immediately, machine2 also forward data and an empty list to
> > machine3. Machine1, machine2, machine3 also write the data to the disk
> > when sending data.
> >
> > If any machine is down, the afr just remove it from the active-machine
> > list. And add it when the machine is up again.
> >
> > This model is a little far from the current framework, but I think it's
> > a good idea to write at 100 MB/s instead of 30+ MB/s in a gigabyte
> > network.
>
> This all assumes a lot about the target network, which is a bad thing
> when trying to be as flexible as GlusterFS.  For example, machine1 and
> machine2 may not have an optimized path to each other.  They may be in
> different subnets, off different switches, or in different geographical
> locations, etc.
>
> The only thing your translators provide that isn't already available
> through chained translators is automatic reconfiguration of the chain
> members when a server drops out, which is a good feature, but I would
> rather just add cheap redundant hardware to boost speed, such as extra
> gigabit NICs and switches to allow dedicated paths between select
> systems.  Also, maybe the new switch translator can be added to what's
> already available to achieve what you want, I'm still fuzzy on exactly
> what it can be used for.

It's a good idea to buy more and better hardware. But it's better if
we can achive this by software. :)

>
> > This model is similar with the model in google file system, you can
> > check the figure 2 in a paper of google file system[1]. I put a copy of
> > this figure at [2].
> >
> > [1] http://labs.google.com/papers/gfs-sosp2003.pdf
> > [2] http://picasaweb.google.com/lidaobing/Public/photo#5150803669289886370
>
> Google has a advantage of designing for a specific usage, which makes
> some design choices better for them than for other projects.
>
> > PS, should I copy this feature request to wiki? Or it's ok to only put
> > it here?
>
> OK, now that I've done my best to tear down your proposal and say why
> it's not needed, here's where I put my disclaimer:
>
> 1) I'm not a dev, and I haven't really looked into the code, so I don't
> know how easy or hard your proposal is to actually implement.
> 2) I'm just one person, and even though *I* may think it's not needed,
> others may differ on this point.
>

Thanks for your comment.

-- 
Best Regards,
 LI Daobing