Re: ZkFarmer

"Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> · Fri, 11 May 2012 09:52:43 +1000

Actually, I want to clarify this point;

> But the problem today is that replicate (and
> self-heal) does not understand "partial failure" 
> of its subvolumes. If one of the subvolume of 
> replicate is a distribute, then today's replicate 
> only understands complete failure of the 
> distribute set or it assumes everything is 
> completely fine. 

I haven't seen this in practice .. I have seen
replicate attempt to repair anything that was
"missing" and that both the replicate and the 
underlying bricks were still viable storage
layers in that process ...

----- Original Message -----
>From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx>
>To: "Anand Avati" <anand.avati@xxxxxxxxx>
>Subject:  Re: ZkFarmer
>Date: Fri, 11 May 2012 09:39:58 +1000
>
> > > Sure, I have my own vol files that do (did) what I wanted
> > > and I was supporting myself (and users); the question
> > > (and the point) is what is the GlusterFS *intent*?
> > 
> > 
> > The "intent" (more or less - I hate to use the word as it
> can imply a
> > commitment to what I am about to say, but there isn't one)
> is to keep the
> > bricks (server process) dumb and have the intelligence on
> the client side.
> > This is a "rough goal". There are cases where replication
> on the server
> > side is inevitable (in the case of NFS access) but we keep
> the software
> > architecture undisturbed by running a client process on
> the server machine
> > to achieve it.
> 
> [There's a difference between intent and plan/roadmap]
> 
> Okay.  Unfortunately I am unable to leverage this - I tried
> to serve a Fuse->GlusterFS client mount point (of a 
> Distribute volume) as a GlusterFS posix brick (for a
> Replicate volume) and it wouldn't play ball ..
> 
> > We do plan to support "replication on the server" in the
> future while still
> > retaining the existing software architecture as much as
> possible. This is
> > particularly useful in Hadoop environment where the jobs
> expect write
> > performance of a single copy and expect copy to happen in
> the background.
> > We have the proactive self-heal daemon running on the
> server machines now
> > (which again is a client process which happens to be
> physically placed on
> > the server) which gives us many interesting possibilities
> - i.e, with
> > simple changes where we fool the client side replicate
> translator at the
> > time of transaction initiation that only the closest
> server is up at that
> > point of time and write to it alone, and have the
> proactive self-heal
> > daemon perform the extra copies in the background. This
> would be consistent
> > with other readers as they get directed to the "right"
> version of the file
> > by inspecting the changelogs while the background
> replication is in
> > progress.
> > 
> > The intention of the above example is to give a general
> sense of how we
> > want to evolve the architecture (i.e, the "intention" you
> were referring
> > to) - keep the clients intelligent and servers dumb. If
> some intelligence
> > needs to be built on the physical server, tackle it by
> loading a client
> > process there (there are also "pathinfo xattr" kind of
> internal techniques
> > to figure out locality of the clients in a generic way
> without bringing
> > "server sidedness" into them in a harsh way)
>  
> Okay .. But what happened to the "brick" architecture
> of stacking anything on anything?  I think you point
> that out here ...
> 
>  
> >  I'll
> > > write an rsyncd wrapper myself, to run on top of Gluster,
> > > if the intent is not allow the configuration I'm after
> > > (arbitrary number of disks in one multi-host environment
> > > replicated to an arbitrary number of disks in another
> > > multi-host environment, where ideally each environment
> > > need not sum to the same data capacity, presented in a
> > > single contiguous consumable storage layer to an
> > > arbitrary number of unintelligent clients, that is as
fault
> > > tolerant as I choose it to be including the ability to add
> > > and offline/online and remove storage as I so choose) ..
> > > or switch out the whole solution if Gluster is heading
> > > away from my  needs.  I just need to know what the
> > > direction is .. I may even be able to help get you
there if
> > > you tell me :)
> > >
> > >
> > There are good and bad in both styles (distribute on top
> v/s replicate on
> > top). Replicate on top gives you much better flexibility
> of configuration.
> > Distribute on top is easier for us developers. As a user I
> would like
> > replicate on top as well. But the problem today is that
> replicate (and
> > self-heal) does not understand "partial failure" of its
> subvolumes. If one
> > of the subvolume of replicate is a distribute, then
> today's replicate only
> > understands complete failure of the distribute set or it
> assumes everything
> > is completely fine. An example is self-healing of
> directory entries. If a
> > file is "missing" in one subvolume because a distribute
> node is temporarily
> > down, replicate has no clue why it is missing (or that it
> should keep away
> > from attempting to self-heal). Along the same lines, it
> does not know that
> > once a server is taken off from its distribute subvolume
> for good that it
> > needs to start recreating missing files.
> 
> Hmm.  I loved the brick idea.  I don't like perverting it by
> trying to "see through" layers.  In that context I can see
> two or three expected outcomes from someone building 
> this type of stack (heh: a quick trick brick stack) - when
> a distribute child disappears;
> 
>   At the Distribute layer;
>   1) The distribute name space / stat space 
>        remains in tact, though the content is
>        obviously not avail.
>   2) The distribute presentation is pure and true
>        of its constituents, showing only the names 
>        / stats that are online/avail.
> 
>   In its standalone case, 2 is probably 
> preferable as it allows clean add/start/stop/
> remove capacity.
> 
>   At the Replicate layer;
>    3) replication occurs only where the name /
>         stat space shows a gap
>    4) the replication occurs at any delta
> 
>   I don't think there's a real choice here, even 
> if 3 were sensible, what would replicate do if 
> there was a local name and even just a remote
> file size change, when there's no local content 
> to update; it must be 4.
> 
>   In which case, I would expect that a replicate 
> on top of a distribute with a missing child would
> suddenly see a delta that it would immediately 
> set about repairing.
> 
>  
> > The effort to fix this seems to be big enough to disturb
> the inertia of
> > status quo. If this is fixed, we can definitely adopt a
> replicate-on-top
> > mode in glusterd.
> 
> I'm not sure why there needs to be a "fix" .. wasn't 
> the previous behaviour sensible?
> 
> Or, if there is something to "change", then 
> bolstering the distribute module might be enough -
> a combination of 1 and 2 above.
> 
> Try this out: what if the Distribute layer maintained 
> a full name space on each child, and didn't allow 
> "recreation"?  Say 3 children, one is broken/offline, 
> so that /path/to/child/3/file is missing but is known 
> to be missing (internally to Distribute).  Then the 
> Distribute brick can both not show the name 
> space to the parent layers, but can also actively 
> prevent manipulation of those files (the parent 
> can neither stat /path/to/child/3/file nor unlink, nor
> create/write to it).  If this change is meant to be 
> permanent, then the administrative act of 
> removing the child from distribute will then 
> truncate the locked name space, allowing parents
> (be they users or other bricks, like Replicate) to 
> act as they please (such as recreating the 
> missing files).
> 
> If you adhere to the principles that I thought I 
> understood from 2009 or so then you should be 
> able to let the users create unforeseen Gluster
> architectures without fear or impact.  I.e. 
> 
>   i)   each brick is fully self contained *
>   ii)   physical bricks are the bread of a brick
>         stack sandwich **
>   iii)  any logical brick can appear above/below 
>        any other logical brick in a brick stack
> 
>   *  Not mandating a 1:1 file mapping from layer 
>       to layer
> 
>   ** Eg: the Posix (bottom), Client (bottom), 
>       Server (top) and NFS (top) are all 
>       regarded as physical bricks.
> 
> Thus it was my expectation that a dedupe brick
> (being logical) could either go above or below 
> a distribute brick (also logical), for example.
> 
> Or that an encryption brick could go on top
> of replicate which was on top of encryption
> which was on top of distribute which was on 
> top of encryption on top of posix, for example.
> 
> 
> Or .. am I over simplifying the problem space?
> 
> 
> 
> --
> Ian Latter
> Late night coder ..
> http://midnightcode.org/
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
> 

--
Ian Latter
Late night coder ..
http://midnightcode.org/