On Sun, 2013-09-29 at 22:41 -0700, Anand Avati wrote: > I see what you are asking. First of all, when running a 2-replica > volume > you almost pretty much always want to have an even number of servers, > and > add servers in even numbers. Ideally the two "sides" of the replicas > should > be placed in separate failures zones - separate racks with separate > power > supplies or separate AZs in the cloud. Having an odd number of servers > with > an 2 replicas is a very "odd" configuration. In all these years I am > yet to > come across a customer who has a production cluster with 2 replicas > and an > odd number of servers. And setting up replicas in such a chained > manner > makes it hard to reason about availability, especially when you are > trying > recover from a disaster. Having clear and separate "pairs" is > definitely > what is recommended. Obviously I completely agree. In fact, I've written most of the code for this scenario, however I'm trying to build out my code to support the general case. > > That being said, nothing prevents one from setting up a chain like > above as > long as you are comfortable with the complexity of the configuration. > And > phasing out replace-brick in favor of add-brick/remove-brick does not > make > the above configuration impossible either. Let's say you have a > chained > configuration of N servers, with pairs formed between every: > > h(i):/b1 h((i+1) % N):/b2 | i := 0 -> N-1 Perfect... So far, so good. > > Now you add N+1th server. This server will be "N" because, we're zero-based in your example... > > Using replace-brick, you have been doing thus far: > > 1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was "part of a previous > brick" Here is that server, we complete the chain from hN to H0. Let's change the name of h0:/b2a to h0:/b2-tmp instead. The problem is that this hopes we have room for a b2-tmp on h0 ! > 2. replace-brick h0:/b2 hN:/b2 start ... commit Here if you meant h0:/b2a aka h0:/b2-tmp (instead of h0:/b2) doesn't this break the chain ? Since now hN is now a stand alone with b1 and b2, and not part of the chain? In fact, the b1 and b2 on hN are actually replicas of each other so this is a SPOF. > > In case you are doing an add-brick/remove-brick approach, you would > now > instead do: > > 1. add-brick h(N-1):/b1a hN:/b2 > 2. add-brick hN:/b1 h0:/b2a > 3. remove-brick h(N-1):/b1 h0:/b2 start ... commit I think this algorithm works. Although I'd have to test it :P The one downside (which I actually have a work around to) is that the new bricks have to be named different things than the original ones. Is there a way around this? > > You will not be left with only 1 copy of a file at any point in the > process, and achieve the same "end result" as you were with > replace-brick. > As mentioned before, I once again request you to consider if you > really > want to deal with the configuration complexity of having chained > replication, instead of just adding servers in pairs. I am just trying to avoid corner cases in my code. Puppet won't work well with those :P > > Please ask if there are any more questions or concerns. I have some follow up, but for the moment, I have another question to add into this thread. It's the same idea really... Suppose you have a set of sanely named and ordered hosts and bricks. Is there one (and only one) logical ordering for them? I've decided that the answer is yes, and I've written the algorithm for ordering them: https://github.com/purpleidea/puppet-gluster/blob/master/lib/facter/gluster_bricks.rb#L77 Do you have any comments / objections ? I've attached an easy standalone version of this code to run. (brick_logic_ordering_wip.rb) I also have a more complicated version of this code. (brick_logic_ordering_v2_wip.rb) This code does almost the same thing as the first version. The difference is that this version supports a proposed "brick nomenclature". (See below) What does this all mean? My theory: If you can define a logical brick and hostname naming convention, and that you always use it, then for every given list of bricks, there should be only one logical "ordering" (where an ordering is the linear order needed for a create volume command). Secondly, if you want to add or remove bricks, and you do so by following the naming convention, then the combined old list + new bricks can also be sorted in a single linear ordering. Furthermore, there exists an algorithm that can compute the needed add/remove brick commands to transform from the initial set to the second set. I've attached this algorithm here: (brick_logic_transform_v1_wip.rb) The only other thing to mention is the brick nomenclature: It is: /path/bxxxxxxx#vzzzz where b is a constant char 'b' where xxxxxxx is a zero padded int for brick # where #vzzzz is a constant '#v' followed by zzzz where zzzz is a zero padded int for version # each time new bricks are added, you increment the max visible version # and use that. if no version number is specified, then we assume version 1. The length of padding must be decided on in advance and can't be changed. valid brick names include: /data/b000004 /data/b000022#v0003 and so on... Hostnames are simple: hostnameYYYY where YYYY is a padded int, and you distribute your hosts sequentially across racks or switches or whatever your commonality for SPOF is. Technically, for the transforms, I'm not even sure the version # is necessary. The big problem with my algorithms, is that they don't work for chained configurations. I'd love to be able to make that so!!! Why is all this relevant ? Because if I can solve these problems, Gluster users can have fully decentralized elastic volumes that grow/shrink on demand, without ever having to manually run add/remove brick commands. I'll be able to do all of this with puppet-gluster for example. Users will just run puppet, without changing and configurations, and hosts will automatically come up and grow to the size the hardware supports. Most of the code is already published. More to come. Hope that was all understandable. It's probably hard to talk about this by email, but I'm trying. :) Cheers, James > > Avati -------------- next part -------------- A non-text attachment was scrubbed... Name: brick_logic_ordering_wip.rb Type: application/x-ruby Size: 7595 bytes Desc: not available URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131011/90037bea/attachment.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: brick_logic_ordering_v2_wip.rb Type: application/x-ruby Size: 6050 bytes Desc: not available URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131011/90037bea/attachment-0001.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: brick_logic_transform_v1_wip.rb Type: application/x-ruby Size: 11354 bytes Desc: not available URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131011/90037bea/attachment-0002.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131011/90037bea/attachment.sig>