Actually, I want to clarify this point; > But the problem today is that replicate (and > self-heal) does not understand "partial failure" > of its subvolumes. If one of the subvolume of > replicate is a distribute, then today's replicate > only understands complete failure of the > distribute set or it assumes everything is > completely fine. I haven't seen this in practice .. I have seen replicate attempt to repair anything that was "missing" and that both the replicate and the underlying bricks were still viable storage layers in that process ... ----- Original Message ----- >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> >To: "Anand Avati" <anand.avati@xxxxxxxxx> >Subject: Re: ZkFarmer >Date: Fri, 11 May 2012 09:39:58 +1000 > > > > Sure, I have my own vol files that do (did) what I wanted > > > and I was supporting myself (and users); the question > > > (and the point) is what is the GlusterFS *intent*? > > > > > > The "intent" (more or less - I hate to use the word as it > can imply a > > commitment to what I am about to say, but there isn't one) > is to keep the > > bricks (server process) dumb and have the intelligence on > the client side. > > This is a "rough goal". There are cases where replication > on the server > > side is inevitable (in the case of NFS access) but we keep > the software > > architecture undisturbed by running a client process on > the server machine > > to achieve it. > > [There's a difference between intent and plan/roadmap] > > Okay. Unfortunately I am unable to leverage this - I tried > to serve a Fuse->GlusterFS client mount point (of a > Distribute volume) as a GlusterFS posix brick (for a > Replicate volume) and it wouldn't play ball .. > > > We do plan to support "replication on the server" in the > future while still > > retaining the existing software architecture as much as > possible. This is > > particularly useful in Hadoop environment where the jobs > expect write > > performance of a single copy and expect copy to happen in > the background. > > We have the proactive self-heal daemon running on the > server machines now > > (which again is a client process which happens to be > physically placed on > > the server) which gives us many interesting possibilities > - i.e, with > > simple changes where we fool the client side replicate > translator at the > > time of transaction initiation that only the closest > server is up at that > > point of time and write to it alone, and have the > proactive self-heal > > daemon perform the extra copies in the background. This > would be consistent > > with other readers as they get directed to the "right" > version of the file > > by inspecting the changelogs while the background > replication is in > > progress. > > > > The intention of the above example is to give a general > sense of how we > > want to evolve the architecture (i.e, the "intention" you > were referring > > to) - keep the clients intelligent and servers dumb. If > some intelligence > > needs to be built on the physical server, tackle it by > loading a client > > process there (there are also "pathinfo xattr" kind of > internal techniques > > to figure out locality of the clients in a generic way > without bringing > > "server sidedness" into them in a harsh way) > > Okay .. But what happened to the "brick" architecture > of stacking anything on anything? I think you point > that out here ... > > > > I'll > > > write an rsyncd wrapper myself, to run on top of Gluster, > > > if the intent is not allow the configuration I'm after > > > (arbitrary number of disks in one multi-host environment > > > replicated to an arbitrary number of disks in another > > > multi-host environment, where ideally each environment > > > need not sum to the same data capacity, presented in a > > > single contiguous consumable storage layer to an > > > arbitrary number of unintelligent clients, that is as fault > > > tolerant as I choose it to be including the ability to add > > > and offline/online and remove storage as I so choose) .. > > > or switch out the whole solution if Gluster is heading > > > away from my needs. I just need to know what the > > > direction is .. I may even be able to help get you there if > > > you tell me :) > > > > > > > > There are good and bad in both styles (distribute on top > v/s replicate on > > top). Replicate on top gives you much better flexibility > of configuration. > > Distribute on top is easier for us developers. As a user I > would like > > replicate on top as well. But the problem today is that > replicate (and > > self-heal) does not understand "partial failure" of its > subvolumes. If one > > of the subvolume of replicate is a distribute, then > today's replicate only > > understands complete failure of the distribute set or it > assumes everything > > is completely fine. An example is self-healing of > directory entries. If a > > file is "missing" in one subvolume because a distribute > node is temporarily > > down, replicate has no clue why it is missing (or that it > should keep away > > from attempting to self-heal). Along the same lines, it > does not know that > > once a server is taken off from its distribute subvolume > for good that it > > needs to start recreating missing files. > > Hmm. I loved the brick idea. I don't like perverting it by > trying to "see through" layers. In that context I can see > two or three expected outcomes from someone building > this type of stack (heh: a quick trick brick stack) - when > a distribute child disappears; > > At the Distribute layer; > 1) The distribute name space / stat space > remains in tact, though the content is > obviously not avail. > 2) The distribute presentation is pure and true > of its constituents, showing only the names > / stats that are online/avail. > > In its standalone case, 2 is probably > preferable as it allows clean add/start/stop/ > remove capacity. > > At the Replicate layer; > 3) replication occurs only where the name / > stat space shows a gap > 4) the replication occurs at any delta > > I don't think there's a real choice here, even > if 3 were sensible, what would replicate do if > there was a local name and even just a remote > file size change, when there's no local content > to update; it must be 4. > > In which case, I would expect that a replicate > on top of a distribute with a missing child would > suddenly see a delta that it would immediately > set about repairing. > > > > The effort to fix this seems to be big enough to disturb > the inertia of > > status quo. If this is fixed, we can definitely adopt a > replicate-on-top > > mode in glusterd. > > I'm not sure why there needs to be a "fix" .. wasn't > the previous behaviour sensible? > > Or, if there is something to "change", then > bolstering the distribute module might be enough - > a combination of 1 and 2 above. > > Try this out: what if the Distribute layer maintained > a full name space on each child, and didn't allow > "recreation"? Say 3 children, one is broken/offline, > so that /path/to/child/3/file is missing but is known > to be missing (internally to Distribute). Then the > Distribute brick can both not show the name > space to the parent layers, but can also actively > prevent manipulation of those files (the parent > can neither stat /path/to/child/3/file nor unlink, nor > create/write to it). If this change is meant to be > permanent, then the administrative act of > removing the child from distribute will then > truncate the locked name space, allowing parents > (be they users or other bricks, like Replicate) to > act as they please (such as recreating the > missing files). > > If you adhere to the principles that I thought I > understood from 2009 or so then you should be > able to let the users create unforeseen Gluster > architectures without fear or impact. I.e. > > i) each brick is fully self contained * > ii) physical bricks are the bread of a brick > stack sandwich ** > iii) any logical brick can appear above/below > any other logical brick in a brick stack > > * Not mandating a 1:1 file mapping from layer > to layer > > ** Eg: the Posix (bottom), Client (bottom), > Server (top) and NFS (top) are all > regarded as physical bricks. > > Thus it was my expectation that a dedupe brick > (being logical) could either go above or below > a distribute brick (also logical), for example. > > Or that an encryption brick could go on top > of replicate which was on top of encryption > which was on top of distribute which was on > top of encryption on top of posix, for example. > > > Or .. am I over simplifying the problem space? > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxx > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/