Re: Volume stuck unable to add a brick

Karthik Subrahmanya <ksubrahm@xxxxxxxxxx> · Tue, 16 Apr 2019 17:49:49 +0530

Hi Boris,
Thank you for providing the logs.
The problem here is because of the "auth.allow: 127.0.0.1" setting on the volume.
When you try to add a new brick to the volume internally replication module will try to set some metadata on the existing bricks to mark pending heal on the new brick, by creating a temporary mount. Because of the auth.allow setting that mount gets permission errors as seen in the below logs, leading to add-brick failure.

From data-gluster-dockervols.log-webserver9 :
[2019-04-15 14:00:34.226838] I [addr.c:55:compare_addr_and_update] 0-/data/gluster/dockervols: allowed = "127.0.0.1", received addr = "192.168.200.147"
[2019-04-15 14:00:34.226895] E [MSGID: 115004] [authenticate.c:224:gf_authenticate] 0-auth: no authentication module is interested in accepting remote-client (null)
[2019-04-15 14:00:34.227129] E [MSGID: 115001] [server-handshake.c:848:server_setvolume] 0-dockervols-server: Cannot authenticate client from webserver8.cast.org-55674-2019/04/15-14:00:20:495333-dockervols-client-2-0-0 3.12.2 [Permission denied]

From dockervols-add-brick-mount.log :
[2019-04-15 14:00:20.672033] W [MSGID: 114043] [client-handshake.c:1109:client_setvolume_cbk] 0-dockervols-client-2: failed to set the volume [Permission denied]
[2019-04-15 14:00:20.672102] W [MSGID: 114007] [client-handshake.c:1138:client_setvolume_cbk] 0-dockervols-client-2: failed to get 'process-uuid' from reply dict [Invalid argument]
[2019-04-15 14:00:20.672129] E [MSGID: 114044] [client-handshake.c:1144:client_setvolume_cbk] 0-dockervols-client-2: SETVOLUME on remote-host failed: Authentication failed [Permission denied]
[2019-04-15 14:00:20.672151] I [MSGID: 114049] [client-handshake.c:1258:client_setvolume_cbk] 0-dockervols-client-2: sending AUTH_FAILED event

This is a known issue and we are planning to fix this. For the time being we have a workaround for this.
- Before you try adding the brick set the auth.allow option to default i.e., "*" or you can do this by running "gluster v reset <volname> auth.allow"
- Add the brick
- After it succeeds set back the auth.allow option to the previous value.

Regards,
Karthik

On Tue, Apr 16, 2019 at 5:20 PM Boris Goldowsky <bgoldowsky@xxxxxxxx> wrote:

OK, log files attached. 

Boris

From: Karthik Subrahmanya <ksubrahm@xxxxxxxxxx>

Date: Tuesday, April 16, 2019 at 2:52 AM

To: Atin Mukherjee <atin.mukherjee83@xxxxxxxxx>, Boris Goldowsky <bgoldowsky@xxxxxxxx>

Cc: Gluster-users <gluster-users@xxxxxxxxxxx>

Subject: Re:  Volume stuck unable to add a brick

On Mon, Apr 15, 2019 at 9:43 PM Atin Mukherjee <atin.mukherjee83@xxxxxxxxx> wrote:

+Karthik Subrahmanya 

Didn't we we fix this problem recently? Failed to set extended attribute indicates that temp mount is failing and we don't have quorum number of bricks up.

We had two fixes which handles two kind of add-brick scenarios.

[1] Fails add-brick when increasing the replica count if any of the brick is down to avoid data loss. This can be overridden by using the force option.

[2] Allow add-brick to set the extended attributes by the temp mount if the volume is already mounted (has clients).

They are in version 3.12.2 so, patch [1] is present there. But since they are using the force option it should not have any problem even if they have any brick down. The error message they are getting is also different, so it is not because
 of any brick being down I guess.

Patch [2] is not present in 3.12.2 and it is not the conversion from plain distribute to replicate volume. So the scenario is different here.

It seems like they are hitting some other issue.

@Boris,

Can you attach the add-brick's temp mount log. The file name should look something like "dockervols-add-brick-mount.log". Can you also provide all the brick logs of that volume during that time.

[1] https://review.gluster.org/#/c/glusterfs/+/16330/

[2] https://review.gluster.org/#/c/glusterfs/+/21791/

Regards,

Karthik

Boris - What's the gluster version are you using?

On Mon, Apr 15, 2019 at 7:35 PM Boris Goldowsky <bgoldowsky@xxxxxxxx> wrote:

Atin, thank you for the reply.  Here are all of those pieces of information:

[bgoldowsky@webserver9 ~]$ gluster --version
glusterfs 3.12.2
(same on all nodes)

[bgoldowsky@webserver9 ~]$ sudo gluster peer status
Number of Peers: 3

Hostname:
webserver11.cast.org
Uuid: c2b147fd-cab4-4859-9922-db5730f8549d
State: Peer in Cluster (Connected)

Hostname:
webserver1.cast.org
Uuid: 4b918f65-2c9d-478e-8648-81d1d6526d4c
State: Peer in Cluster (Connected)
Other names:
192.168.200.131
webserver1

Hostname:
webserver8.cast.org
Uuid: be2f568b-61c5-4016-9264-083e4e6453a2
State: Peer in Cluster (Connected)
Other names:
webserver8

[bgoldowsky@webserver1 ~]$ sudo gluster v info 
Volume Name: dockervols
Type: Replicate
Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: webserver1:/data/gluster/dockervols
Brick2: webserver11:/data/gluster/dockervols
Brick3: webserver9:/data/gluster/dockervols
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
auth.allow: 127.0.0.1

Volume Name: testvol
Type: Replicate
Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: webserver1:/data/gluster/testvol
Brick2: webserver9:/data/gluster/testvol
Brick3: webserver11:/data/gluster/testvol
Brick4: webserver8:/data/gluster/testvol
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

[bgoldowsky@webserver8 ~]$ sudo gluster v info
Volume Name: dockervols
Type: Replicate
Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: webserver1:/data/gluster/dockervols
Brick2: webserver11:/data/gluster/dockervols
Brick3: webserver9:/data/gluster/dockervols
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
auth.allow: 127.0.0.1

Volume Name: testvol
Type: Replicate
Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: webserver1:/data/gluster/testvol
Brick2: webserver9:/data/gluster/testvol
Brick3: webserver11:/data/gluster/testvol
Brick4: webserver8:/data/gluster/testvol
Options Reconfigured:
nfs.disable: on
transport.address-family: inet

[bgoldowsky@webserver9 ~]$ sudo gluster v info
Volume Name: dockervols
Type: Replicate
Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: webserver1:/data/gluster/dockervols
Brick2: webserver11:/data/gluster/dockervols
Brick3: webserver9:/data/gluster/dockervols
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
auth.allow: 127.0.0.1

Volume Name: testvol
Type: Replicate
Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: webserver1:/data/gluster/testvol
Brick2: webserver9:/data/gluster/testvol
Brick3: webserver11:/data/gluster/testvol
Brick4: webserver8:/data/gluster/testvol
Options Reconfigured:
nfs.disable: on
transport.address-family: inet

[bgoldowsky@webserver11 ~]$ sudo gluster v info
Volume Name: dockervols
Type: Replicate
Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: webserver1:/data/gluster/dockervols
Brick2: webserver11:/data/gluster/dockervols
Brick3: webserver9:/data/gluster/dockervols
Options Reconfigured:
auth.allow: 127.0.0.1
transport.address-family: inet
nfs.disable: on

Volume Name: testvol
Type: Replicate
Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: webserver1:/data/gluster/testvol
Brick2: webserver9:/data/gluster/testvol
Brick3: webserver11:/data/gluster/testvol
Brick4: webserver8:/data/gluster/testvol
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

[bgoldowsky@webserver9 ~]$ sudo gluster volume add-brick dockervols replica 4 webserver8:/data/gluster/dockervols force
volume add-brick: failed: Commit failed on
webserver8.cast.org. Please check log file for details.

Webserver8 glusterd.log:

[2019-04-15 13:55:42.338197] I [MSGID: 106488] [glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management:
 Received get vol req
The message "I [MSGID: 106488] [glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management: Received get vol
 req" repeated 2 times between [2019-04-15 13:55:42.338197] and [2019-04-15 13:55:42.341618]
[2019-04-15 14:00:20.445011] I [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x3a215)
 [0x7fe697764215] -->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0xe3e9d) [0x7fe69780de9d] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fe6a2d16ea5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/add-brick/pre/S28Quota-enable-root-xattr-heal.sh
 --volname=dockervols --version=1 --volume-op=add-brick --gd-workdir=/var/lib/glusterd
[2019-04-15 14:00:20.445148] I [MSGID: 106578] [glusterd-brick-ops.c:1354:glusterd_op_perform_add_bricks] 0-management:
 replica-count is set 4
[2019-04-15 14:00:20.445184] I [MSGID: 106578] [glusterd-brick-ops.c:1364:glusterd_op_perform_add_bricks] 0-management:
 type is set 0, need to change it
[2019-04-15 14:00:20.672347] E [MSGID: 106054] [glusterd-utils.c:13863:glusterd_handle_replicate_brick_ops] 0-management:
 Failed to set extended attribute trusted.add-brick : Transport endpoint is not connected [Transport endpoint is not connected]
[2019-04-15 14:00:20.693491] E [MSGID: 101042] [compat.c:569:gf_umount_lazy] 0-management: Lazy unmount of /tmp/mntmvdFGq
 [Transport endpoint is not connected]
[2019-04-15 14:00:20.693597] E [MSGID: 106074] [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to add
 bricks
[2019-04-15 14:00:20.693637] E [MSGID: 106123] [glusterd-mgmt.c:312:gd_mgmt_v3_commit_fn] 0-management: Add-brick commit
 failed.
[2019-04-15 14:00:20.693667] E [MSGID: 106123] [glusterd-mgmt-handler.c:616:glusterd_handle_commit_fn] 0-management: commit
 failed on operation Add brick

Webserver11 log file:

[2019-04-15 13:56:29.563270] I [MSGID: 106488] [glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management:
 Received get vol req
The message "I [MSGID: 106488] [glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management: Received get vol
 req" repeated 2 times between [2019-04-15 13:56:29.563270] and [2019-04-15 13:56:29.566209]
[2019-04-15 14:00:33.996866] I [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x3a215)
 [0x7f36de924215] -->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0xe3e9d) [0x7f36de9cde9d] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f36e9ed6ea5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/add-brick/pre/S28Quota-enable-root-xattr-heal.sh
 --volname=dockervols --version=1 --volume-op=add-brick --gd-workdir=/var/lib/glusterd
[2019-04-15 14:00:33.996979] I [MSGID: 106578] [glusterd-brick-ops.c:1354:glusterd_op_perform_add_bricks] 0-management:
 replica-count is set 4
[2019-04-15 14:00:33.997004] I [MSGID: 106578] [glusterd-brick-ops.c:1364:glusterd_op_perform_add_bricks] 0-management:
 type is set 0, need to change it
[2019-04-15 14:00:34.013789] I [MSGID: 106132] [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: nfs already stopped
[2019-04-15 14:00:34.013849] I [MSGID: 106568] [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: nfs service is
 stopped
[2019-04-15 14:00:34.017535] I [MSGID: 106568] [glusterd-proc-mgmt.c:88:glusterd_proc_stop] 0-management: Stopping glustershd
 daemon running in pid: 6087
[2019-04-15 14:00:35.018783] I [MSGID: 106568] [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: glustershd service
 is stopped
[2019-04-15 14:00:35.018952] I [MSGID: 106567] [glusterd-svc-mgmt.c:211:glusterd_svc_start] 0-management: Starting glustershd
 service
[2019-04-15 14:00:35.028306] I [MSGID: 106132] [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: bitd already stopped
[2019-04-15 14:00:35.028408] I [MSGID: 106568] [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: bitd service is
 stopped
[2019-04-15 14:00:35.028601] I [MSGID: 106132] [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: scrub already
 stopped
[2019-04-15 14:00:35.028645] I [MSGID: 106568] [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: scrub service is
 stopped

Thank you for taking a look!

Boris

From:
Atin Mukherjee <atin.mukherjee83@xxxxxxxxx>

Date: Friday, April 12, 2019 at 1:10 PM

To: Boris Goldowsky <bgoldowsky@xxxxxxxx>

Cc: Gluster-users <gluster-users@xxxxxxxxxxx>

Subject: Re:  Volume stuck unable to add a brick

On Fri, 12 Apr 2019 at 22:32, Boris Goldowsky <bgoldowsky@xxxxxxxx> wrote:

I’ve got a replicated volume with three bricks  (“1x3=3”), the idea is to have a common set of files that are locally available on all the machines (Scientific Linux 7, which is
 essentially CentOS 7) in a cluster.

I tried to add on a fourth machine, so used a command like this:

sudo gluster volume add-brick dockervols replica 4 webserver8:/data/gluster/dockervols force

but the result is:

volume add-brick: failed: Commit failed on webserver1. Please check log file for details.

Commit failed on webserver8. Please check log file for details.

Commit failed on webserver11. Please check log file for details.

Tried: removing the new brick (this also fails) and trying again.
Tried: checking the logs. The log files are not enlightening to me – I don’t know what’s normal and what’s not.

From webserver8 & webserver11 could you attach glusterd log files?

Also please share following:

- gluster version? (gluster —version)

- Output of ‘gluster peer status’

- Output of ‘gluster v info’ from all 4 nodes.

Tried: deleting the brick directory from previous attempt, so that it’s not in the way.
Tried: restarting gluster services
Tried: rebooting
Tried: setting up a new volume, replicated to all four machines. This works, so I’m assuming it’s not a networking issue.  But still fails with this existing volume that has the
 critical data in it.

Running out of ideas. Any suggestions?  Thank you!

Boris

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-users

--

--Atin

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users