Phasing out replace-brick for data migration in favor of remove-brick.

avati at gluster.org (Anand Avati) · Thu, 10 Oct 2013 14:30:00 -0700

http://review.gluster.org/#/c/6031/ (patch to remove replace-brick data
migration) is slated for merge before 3.5. Review comments (on gerrit)
welcome.

Thanks,
Avati

On Thu, Oct 3, 2013 at 9:27 AM, Anand Avati <avati at gluster.org> wrote:

> On Thu, Oct 3, 2013 at 8:57 AM, KueiHuan Chen <kueihuan.chen at gmail.com>wrote:
>
>> Hi, Avati
>>
>>   In your chained configuration, how to replace whole h1 without
>> replace-brick ? Is there has a better way than replace brick in this
>> situation ?
>>
>>   h0:/b1 h1:/b2 h1:/b1 h2:/b2 h2:/b1 h0:/b2 (A new h3 want to replace old
>> h1.)
>>
>
>
> You have a couple of options,
>
> A)
>
> replace-brick h1:/b1 h3:/b1
> replace-brick h1:/b2 h3:/b2
>
> and let self-heal bring the disks up to speed, or
>
> B)
>
> add-brick replica 2 h3:/b1 h2:/b2a
> add-brick replica 2 h3:/b2 h0:/b1a
>
> remove-brick h0:/b1 h1:/b2 start .. commit
> remove-brick h2:/b2 h1:/b1 start .. commit
>
> Let me know if you still have questions.
>
> Avati
>
>
>> Thanks.
>> Best Regards,
>>
>> KueiHuan-Chen
>> Synology Incorporated.
>> Email: khchen at synology.com
>> Tel: +886-2-25521814 ext.827
>>
>>
>> 2013/9/30 Anand Avati <avati at gluster.org>:
>> >
>> >
>> >
>> > On Fri, Sep 27, 2013 at 1:56 AM, James <purpleidea at gmail.com> wrote:
>> >>
>> >> On Fri, 2013-09-27 at 00:35 -0700, Anand Avati wrote:
>> >> > Hello all,
>> >> Hey,
>> >>
>> >> Interesting timing for this post...
>> >> I've actually started working on automatic brick addition/removal. (I'm
>> >> planning to add this to puppet-gluster of course.) I was hoping you
>> >> could help out with the algorithm. I think it's a bit different if
>> >> there's no replace-brick command as you are proposing.
>> >>
>> >> Here's the problem:
>> >> Given a logically optimal initial volume:
>> >>
>> >> volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2
>> >>
>> >> suppose I know that I want to add/remove bricks such that my new volume
>> >> (if I had created it new) looks like:
>> >>
>> >> volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2
>> >> h5:/b2 h6:/b2
>> >>
>> >> What is the optimal algorithm for determining the correct sequence of
>> >> transforms that are needed to accomplish this task. Obviously there are
>> >> some simpler corner cases, but I'd like to solve the general case.
>> >>
>> >> The transforms are obviously things like running the add-brick {...}
>> and
>> >> remove-brick {...} commands.
>> >>
>> >> Obviously we have to take into account that it's better to add bricks
>> >> and rebalance before we remove bricks and risk the file system if a
>> >> replica is missing. The algorithm should work for any replica N. We
>> want
>> >> to make sure the new layout makes sense to replicate the data on
>> >> different servers. In many cases, this will require creating a circular
>> >> "chain" of bricks as illustrated in the bottom of this image:
>> >> http://joejulian.name/media/uploads/images/replica_expansion.png
>> >> for example. I'd like to optimize for safety first, and then time, I
>> >> imagine.
>> >>
>> >> Many thanks in advance.
>> >>
>> >
>> > I see what you are asking. First of all, when running a 2-replica
>> volume you
>> > almost pretty much always want to have an even number of servers, and
>> add
>> > servers in even numbers. Ideally the two "sides" of the replicas should
>> be
>> > placed in separate failures zones - separate racks with separate power
>> > supplies or separate AZs in the cloud. Having an odd number of servers
>> with
>> > an 2 replicas is a very "odd" configuration. In all these years I am
>> yet to
>> > come across a customer who has a production cluster with 2 replicas and
>> an
>> > odd number of servers. And setting up replicas in such a chained manner
>> > makes it hard to reason about availability, especially when you are
>> trying
>> > recover from a disaster. Having clear and separate "pairs" is definitely
>> > what is recommended.
>> >
>> > That being said, nothing prevents one from setting up a chain like
>> above as
>> > long as you are comfortable with the complexity of the configuration.
>> And
>> > phasing out replace-brick in favor of add-brick/remove-brick does not
>> make
>> > the above configuration impossible either. Let's say you have a chained
>> > configuration of N servers, with pairs formed between every:
>> >
>> > h(i):/b1 h((i+1) % N):/b2 | i := 0 -> N-1
>> >
>> > Now you add N+1th server.
>> >
>> > Using replace-brick, you have been doing thus far:
>> >
>> > 1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was "part of a previous
>> brick"
>> > 2. replace-brick h0:/b2 hN:/b2 start ... commit
>> >
>> > In case you are doing an add-brick/remove-brick approach, you would now
>> > instead do:
>> >
>> > 1. add-brick h(N-1):/b1a hN:/b2
>> > 2. add-brick hN:/b1 h0:/b2a
>> > 3. remove-brick h(N-1):/b1 h0:/b2 start ... commit
>> >
>> > You will not be left with only 1 copy of a file at any point in the
>> process,
>> > and achieve the same "end result" as you were with replace-brick. As
>> > mentioned before, I once again request you to consider if you really
>> want to
>> > deal with the configuration complexity of having chained replication,
>> > instead of just adding servers in pairs.
>> >
>> > Please ask if there are any more questions or concerns.
>> >
>> > Avati
>> >
>> >
>> >>
>> >> James
>> >>
>> >> Some comments below, although I'm a bit tired so I hope I said it all
>> >> right.
>> >>
>> >> > DHT's remove-brick + rebalance has been enhanced in the last couple
>> of
>> >> > releases to be quite sophisticated. It can handle graceful
>> >> > decommissioning
>> >> > of bricks, including open file descriptors and hard links.
>> >> Sweet
>> >>
>> >> >
>> >> > This in a way is a feature overlap with replace-brick's data
>> migration
>> >> > functionality. Replace-brick's data migration is currently also used
>> for
>> >> > planned decommissioning of a brick.
>> >> >
>> >> > Reasons to remove replace-brick (or why remove-brick is better):
>> >> >
>> >> > - There are two methods of moving data. It is confusing for the users
>> >> > and
>> >> > hard for developers to maintain.
>> >> >
>> >> > - If server being replaced is a member of a replica set, neither
>> >> > remove-brick nor replace-brick data migration is necessary, because
>> >> > self-healing itself will recreate the data (replace-brick actually
>> uses
>> >> > self-heal internally)
>> >> >
>> >> > - In a non-replicated config if a server is getting replaced by a new
>> >> > one,
>> >> > add-brick <new> + remove-brick <old> "start" achieves the same goal
>> as
>> >> > replace-brick <old> <new> "start".
>> >> >
>> >> > - In a non-replicated config, <replace-brick> is NOT glitch free
>> >> > (applications witness ENOTCONN if they are accessing data) whereas
>> >> > add-brick <new> + remove-brick <old> is completely transparent.
>> >> >
>> >> > - Replace brick strictly requires a server with enough free space to
>> >> > hold
>> >> > the data of the old brick, whereas remove-brick will evenly spread
>> out
>> >> > the
>> >> > data of the bring being removed amongst the remaining servers.
>> >>
>> >> Can you talk more about the replica = N case (where N is 2 or 3?)
>> >> With remove brick, add brick you will need add/remove N (replica count)
>> >> bricks at a time, right? With replace brick, you could just swap out
>> >> one, right? Isn't that a missing feature if you remove replace brick?
>> >>
>> >> >
>> >> > - Replace-brick code is complex and messy (the real reason :p).
>> >> >
>> >> > - No clear reason why replace-brick's data migration is better in any
>> >> > way
>> >> > to remove-brick's data migration.
>> >> >
>> >> > I plan to send out patches to remove all traces of replace-brick data
>> >> > migration code by 3.5 branch time.
>> >> >
>> >> > NOTE that replace-brick command itself will still exist, and you can
>> >> > replace on server with another in case a server dies. It is only the
>> >> > data
>> >> > migration functionality being phased out.
>> >> >
>> >> > Please do ask any questions / raise concerns at this stage :)
>> >> I heard with 3.4 you can somehow change the replica count when adding
>> >> new bricks... What's the full story here please?
>> >>
>> >> Thanks!
>> >> James
>> >>
>> >> >
>> >> > Avati
>> >> > _______________________________________________
>> >> > Gluster-users mailing list
>> >> > Gluster-users at gluster.org
>> >> > http://supercolony.gluster.org/mailman/listinfo/gluster-users
>> >>
>> >
>> >
>> > _______________________________________________
>> > Gluster-users mailing list
>> > Gluster-users at gluster.org
>> > http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131010/5da9d96d/attachment.html>