Re: ZkFarmer

"Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> · Fri, 11 May 2012 09:39:58 +1000

> > Sure, I have my own vol files that do (did) what I wanted
> > and I was supporting myself (and users); the question
> > (and the point) is what is the GlusterFS *intent*?
> 
> 
> The "intent" (more or less - I hate to use the word as it
can imply a
> commitment to what I am about to say, but there isn't one)
is to keep the
> bricks (server process) dumb and have the intelligence on
the client side.
> This is a "rough goal". There are cases where replication
on the server
> side is inevitable (in the case of NFS access) but we keep
the software
> architecture undisturbed by running a client process on
the server machine
> to achieve it.

[There's a difference between intent and plan/roadmap]

Okay.  Unfortunately I am unable to leverage this - I tried
to serve a Fuse->GlusterFS client mount point (of a 
Distribute volume) as a GlusterFS posix brick (for a
Replicate volume) and it wouldn't play ball ..

> We do plan to support "replication on the server" in the
future while still
> retaining the existing software architecture as much as
possible. This is
> particularly useful in Hadoop environment where the jobs
expect write
> performance of a single copy and expect copy to happen in
the background.
> We have the proactive self-heal daemon running on the
server machines now
> (which again is a client process which happens to be
physically placed on
> the server) which gives us many interesting possibilities
- i.e, with
> simple changes where we fool the client side replicate
translator at the
> time of transaction initiation that only the closest
server is up at that
> point of time and write to it alone, and have the
proactive self-heal
> daemon perform the extra copies in the background. This
would be consistent
> with other readers as they get directed to the "right"
version of the file
> by inspecting the changelogs while the background
replication is in
> progress.
> 
> The intention of the above example is to give a general
sense of how we
> want to evolve the architecture (i.e, the "intention" you
were referring
> to) - keep the clients intelligent and servers dumb. If
some intelligence
> needs to be built on the physical server, tackle it by
loading a client
> process there (there are also "pathinfo xattr" kind of
internal techniques
> to figure out locality of the clients in a generic way
without bringing
> "server sidedness" into them in a harsh way)

Okay .. But what happened to the "brick" architecture
of stacking anything on anything?  I think you point
that out here ...

>  I'll
> > write an rsyncd wrapper myself, to run on top of Gluster,
> > if the intent is not allow the configuration I'm after
> > (arbitrary number of disks in one multi-host environment
> > replicated to an arbitrary number of disks in another
> > multi-host environment, where ideally each environment
> > need not sum to the same data capacity, presented in a
> > single contiguous consumable storage layer to an
> > arbitrary number of unintelligent clients, that is as fault
> > tolerant as I choose it to be including the ability to add
> > and offline/online and remove storage as I so choose) ..
> > or switch out the whole solution if Gluster is heading
> > away from my  needs.  I just need to know what the
> > direction is .. I may even be able to help get you there if
> > you tell me :)
> >
> >
> There are good and bad in both styles (distribute on top
v/s replicate on
> top). Replicate on top gives you much better flexibility
of configuration.
> Distribute on top is easier for us developers. As a user I
would like
> replicate on top as well. But the problem today is that
replicate (and
> self-heal) does not understand "partial failure" of its
subvolumes. If one
> of the subvolume of replicate is a distribute, then
today's replicate only
> understands complete failure of the distribute set or it
assumes everything
> is completely fine. An example is self-healing of
directory entries. If a
> file is "missing" in one subvolume because a distribute
node is temporarily
> down, replicate has no clue why it is missing (or that it
should keep away
> from attempting to self-heal). Along the same lines, it
does not know that
> once a server is taken off from its distribute subvolume
for good that it
> needs to start recreating missing files.

Hmm.  I loved the brick idea.  I don't like perverting it by
trying to "see through" layers.  In that context I can see
two or three expected outcomes from someone building 
this type of stack (heh: a quick trick brick stack) - when
a distribute child disappears;

  At the Distribute layer;
  1) The distribute name space / stat space 
       remains in tact, though the content is
       obviously not avail.
  2) The distribute presentation is pure and true
       of its constituents, showing only the names 
       / stats that are online/avail.

  In its standalone case, 2 is probably 
preferable as it allows clean add/start/stop/
remove capacity.

  At the Replicate layer;
   3) replication occurs only where the name /
        stat space shows a gap
   4) the replication occurs at any delta

  I don't think there's a real choice here, even 
if 3 were sensible, what would replicate do if 
there was a local name and even just a remote
file size change, when there's no local content 
to update; it must be 4.

  In which case, I would expect that a replicate 
on top of a distribute with a missing child would
suddenly see a delta that it would immediately 
set about repairing.

> The effort to fix this seems to be big enough to disturb
the inertia of
> status quo. If this is fixed, we can definitely adopt a
replicate-on-top
> mode in glusterd.

I'm not sure why there needs to be a "fix" .. wasn't 
the previous behaviour sensible?

Or, if there is something to "change", then 
bolstering the distribute module might be enough -
a combination of 1 and 2 above.

Try this out: what if the Distribute layer maintained 
a full name space on each child, and didn't allow 
"recreation"?  Say 3 children, one is broken/offline, 
so that /path/to/child/3/file is missing but is known 
to be missing (internally to Distribute).  Then the 
Distribute brick can both not show the name 
space to the parent layers, but can also actively 
prevent manipulation of those files (the parent 
can neither stat /path/to/child/3/file nor unlink, nor
create/write to it).  If this change is meant to be 
permanent, then the administrative act of 
removing the child from distribute will then 
truncate the locked name space, allowing parents
(be they users or other bricks, like Replicate) to 
act as they please (such as recreating the 
missing files).

If you adhere to the principles that I thought I 
understood from 2009 or so then you should be 
able to let the users create unforeseen Gluster
architectures without fear or impact.  I.e. 

  i)   each brick is fully self contained *
  ii)   physical bricks are the bread of a brick
        stack sandwich **
  iii)  any logical brick can appear above/below 
       any other logical brick in a brick stack

  *  Not mandating a 1:1 file mapping from layer 
      to layer

  ** Eg: the Posix (bottom), Client (bottom), 
      Server (top) and NFS (top) are all 
      regarded as physical bricks.

Thus it was my expectation that a dedupe brick
(being logical) could either go above or below 
a distribute brick (also logical), for example.

Or that an encryption brick could go on top
of replicate which was on top of encryption
which was on top of distribute which was on 
top of encryption on top of posix, for example.

Or .. am I over simplifying the problem space?

--
Ian Latter
Late night coder ..
http://midnightcode.org/