Hey Atin,
This is happening because of bringing down the glusterd on the third node before doing the replcae brick.In replace brick we do a temporary mount to mark pending xattr on the source bricks saying that the brick being replaced is sink.
But in this case, since one of the source brick's glusterd is down, when the mount tries to get the port at which the brick is listening,
it fails to get that leading to failure of setting the "trusted.replace_brick" attribute.
For replica 3 volume to say any fop as success it needs at least quorum number of success. Hence the replace brick fails.
it fails to get that leading to failure of setting the "trusted.replace_brick" attribute.
For replica 3 volume to say any fop as success it needs at least quorum number of success. Hence the replace brick fails.
On the QE setup the replace brick would have succeeded only because of some race between glusterd going down and replace brick happening.
Otherwise there is no chance for replace brick to succeed.
Regards,
Karthik
On Tue, Mar 27, 2018 at 7:25 PM, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
Request some help/attention from AFR folks.While writing a test for the patch fix of BZ https://bugzilla.redhat.com/show_bug.cgi?id=1560957 I just can't make my test case to pass where a replace brick commit force always fails on a multi node cluster and that's on the latest mainline code.The fix is a one liner:
atin@dhcp35-96:~/codebase/upstream/glusterfs_master/ glusterfs$ gd HEAD~1
diff --git a/xlators/mgmt/glusterd/src/glusterd-utils.c b/xlators/mgmt/glusterd/src/ glusterd-utils.c
index af30756c9..24d813fbd 100644
--- a/xlators/mgmt/glusterd/src/glusterd-utils.c
+++ b/xlators/mgmt/glusterd/src/glusterd-utils.c
@@ -5995,6 +5995,7 @@ glusterd_brick_start (glusterd_volinfo_t *volinfo,
* TBD: re-use RPC connection across bricks
*/
if (is_brick_mx_enabled ()) {
+brickinfo->port_registered = _gf_true;
ret = glusterd_get_sock_from_brick_ pid (pid, socketpath,
sizeof(socketpath));
if (ret) {
The test does the following:
#!/bin/bash
. $(dirname $0)/../../include.rc
. $(dirname $0)/../../cluster.rc
. $(dirname $0)/../../volume.rc
cleanup;
TEST launch_cluster 3;
TEST $CLI_1 peer probe $H2;
EXPECT_WITHIN $PROBE_TIMEOUT 1 peer_count
TEST $CLI_1 peer probe $H3;
EXPECT_WITHIN $PROBE_TIMEOUT 2 peer_count
TEST $CLI_1 volume set all cluster.brick-multiplex on
TEST $CLI_1 volume create $V0 replica 3 $H1:$B1/${V0}1 $H2:$B2/${V0}1 $H3:$B3/${V0}1
TEST $CLI_1 volume start $V0
EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status_1 $V0 $H1 $B1/${V0}1
EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status_1 $V0 $H2 $B2/${V0}1
EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status_1 $V0 $H3 $B3/${V0}1
#bug-1560957 - replace brick followed by an add-brick in a brick mux setup
#brings down one brick instance
kill_glusterd 3
EXPECT_WITHIN $PROBE_TIMEOUT 1 peer_count
TEST $CLI_1 volume replace-brick $V0 $H1:$B1/${V0}1 $H1:$B1/${V0}1_new commit force
this is where the test always fails saying "volume replace-brick: failed: Commit failed on localhost. Please check log file for details.
TEST $glusterd_3
EXPECT_WITHIN $PROBE_TIMEOUT 2 peer_count
TEST $CLI_1 volume add-brick $V0 replica 3 $H1:$$B1/${V0}3 $H2:$B1/${V0}3 $H3:$B1/${V0}3 commit force
EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status_1 $V0 $H3 $H3:$B1/${V0}1
cleanup;
glusterd log from 1st node
[2018-03-27 13:11:58.630845] E [MSGID: 106053] [glusterd-utils.c:13889:glusterd_handle_replicate_ brick_ops] 0-management: Failed to set extended attribute trusted.replace-brick : Transport endpoint is not connected [Transport endpoint is not connected]
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel