Re: Gluster replicate-brick issues (Distrubuted-Replica)

Joe Julian <joe@xxxxxxxxxxxxxxxx> · Sun, 15 Feb 2015 10:54:15 -0800

Of those missing files, are they maybe dht link files? Mode 1000, size 0.

On February 14, 2015 12:58:12 AM PST, Thomas Holkenbrink <thomas.holkenbrink@xxxxxxxxxxxxxx> wrote:

We have tried to migrate a Brick from one server to another using the following commands.   But the data is NOT being replicated… and the BRICK is not showing up anymore.

Gluster still appears to be working but the Bricks are not balanced and I need to add the other Brick for Server3 that I don’t want to do until after Server1:Brick2 gets replicated.

This is the command to create the Original Volume:

[root@Server1 ~]#
gluster volume create Storage1 replica 2 transport tcp Server1:/exp/br01/brick1 Server2:/exp/br01/brick1 Server1:/exp/br02/brick2 Server2:/exp/br02/brick2

This is the Current configuration BEFORE the migration.. Server3 has been Peer Probed successfully but that has been it

[root@Server1 ~]# gluster --version

glusterfs 3.6.2 built on Jan 22 2015 12:58:11

[root@Server1 ~]# gluster volume status

Status of volume: Storage1

Gluster process                 Port    Online  Pid

------------------------------------------------------------------------------

Brick Server1:/exp/br01/brick1  49152   Y       2167

Brick Server2:/exp/br01/brick1  49152   Y       2192

Brick Server1:/exp/br02/brick2  49153   Y       2172   <--- this is the one that goes missing

Brick Server2:/exp/br02/brick2  49153   Y       2193

NFS Server on localhost         2049    Y       2181

Self-heal Daemon on localhost   N/A     Y       2186

NFS Server on Server2           2049    Y       2205

Self-heal Daemon on Server2     N/A     Y       2210

NFS Server on Server3           2049    Y       6015

Self-heal Daemon on Server3     N/A     Y       6016

Task Status of Volume Storage1

------------------------------------------------------------------------------

There are no active volume tasks

[root@Server1 ~]# gluster volume info

Volume Name: Storage1

Type: Distributed-Replicate

Volume ID: 9616ce42-48bd-4fe3-883f-decd6c4fcd00

Status: Started

Number of Bricks: 2 x 2 = 4

Transport-type: tcp

Bricks:

Brick1: Server1:/exp/br01/brick1

Brick2: Server2:/exp/br01/brick1

Brick3: Server1:/exp/br02/brick2

Brick4: Server2:/exp/br02/brick2

Options Reconfigured:

diagnostics.brick-log-level: WARNING

diagnostics.client-log-level: WARNING

cluster.entry-self-heal: off

cluster.data-self-heal: off

cluster.metadata-self-heal: off

performance.cache-size: 1024MB

performance.cache-max-file-size: 2MB

performance.cache-refresh-timeout: 1

performance.stat-prefetch: off

performance.read-ahead: on

performance.quick-read: off

performance.write-behind-window-size: 4MB

performance.flush-behind: on

performance.write-behind: on

performance.io-thread-count: 32

performance.io-cache: on

network.ping-timeout: 2

nfs.addr-namelookup: off

performance.strict-write-ordering: on

[root@Server1 ~]#

So we start the Migration of the Brick to the new server using the replace Brick command

[root@Server1 ~]# volname=Storage1

[root@Server1 ~]# from=Server1:/exp/br02/brick2

[root@Server1 ~]# to=Server3:/exp/br02/brick2

[root@Server1 ~]# gluster volume replace-brick $volname $from $to start

All replace-brick commands except commit force are deprecated. Do you want to continue? (y/n) y

volume replace-brick: success: replace-brick started successfully

ID: 0062d555-e7eb-4ebe-a264-7e0baf6e7546

[root@Server1 ~]# gluster volume replace-brick $volname $from $to status

All replace-brick commands except commit force are deprecated. Do you want to continue? (y/n) y

volume replace-brick: success: Number of files migrated = 281   Migration complete

At this point everything seems to be in order with no outstanding issues.

[root@Server1 ~]# gluster volume status

Status of volume: Storage1

Gluster process                 Port    Online  Pid

------------------------------------------------------------------------------

Brick Server1:/exp/br01/brick1  49152   Y       2167

Brick Server2:/exp/br01/brick1  49152   Y       2192

Brick Server1:/exp/br02/brick2  49153   Y       27557

Brick Server2:/exp/br02/brick2  49153   Y       2193

NFS Server on localhost         2049    Y       27562

Self-heal Daemon on localhost   N/A     Y       2186

NFS Server on Server2           2049    Y       2205

Self-heal Daemon on Server2     N/A     Y       2210

NFS Server on Server3           2049    Y       6015

Self-heal Daemon on Server3     N/A     Y       6016

Task Status of Volume Storage1

------------------------------------------------------------------------------

Task                 : Replace brick

ID                   : 0062d555-e7eb-4ebe-a264-7e0baf6e7546

Source Brick         : Server1:/exp/br02/brick2

Destination Brick    : Server3:/exp/br02/brick2

Status               : completed

The volume reports that the replace Brick command completed.. so the next step is to commit the change

[root@Server1 ~]# gluster volume replace-brick $volname $from $to commit

All replace-brick commands except commit force are deprecated. Do you want to continue? (y/n) y

volume replace-brick: success: replace-brick commit successful

At this point when I take a look at the status I see that the OLD brick is now missing (Server1:/exp/br02/brick2) AND I don’t see the new Brick… WTF… panic!

[root@Server1 ~]# gluster volume status

Status of volume: Storage1

Gluster process                                         Port    Online  Pid

------------------------------------------------------------------------------

Brick Server1:/exp/br01/brick1  49152   Y       2167

Brick Server2:/exp/br01/brick1  49152   Y       2192

Brick Server2:/exp/br02/brick2  49153   Y       2193

NFS Server on localhost         2049    Y       28906

Self-heal Daemon on localhost   N/A     Y       28911

NFS Server on Server2           2049    Y       2205

Self-heal Daemon on Server2     N/A     Y       2210

NFS Server on Server3           2049    Y       6015

Self-heal Daemon on Server3     N/A     Y       6016

Task Status of Volume Storage1

------------------------------------------------------------------------------

There are no active volume tasks

After the commit on Server1 it does not have the Tasks listed anymore… yet server2 and server3 see this

[root@Server2 ~]# gluster volume status

Status of volume: Storage1

Gluster process                 Port    Online  Pid

------------------------------------------------------------------------------

Brick Server1:/exp/br01/brick1  49152   Y       2167

Brick Server2:/exp/br01/brick1  49152   Y       2192

Brick Server2:/exp/br02/brick2  49153   Y       2193

NFS Server on localhost         2049    Y       2205

Self-heal Daemon on localhost   N/A     Y       2210

NFS Server on 10.45.16.17       2049    Y       28906

Self-heal Daemon on 10.45.16.17 N/A     Y       28911

NFS Server on server3           2049    Y       6015

Self-heal Daemon on server3     N/A     Y       6016

Task Status of Volume Storage1

------------------------------------------------------------------------------

Task                 : Replace brick

ID                   : 0062d555-e7eb-4ebe-a264-7e0baf6e7546

Source Brick         : Server1:/exp/br02/brick2

Destination Brick    : server3:/exp/br02/brick2

Status               : completed

If I navigate the brick on Server3 the brick is NOT empty.. but missing A LOT!  It’s like the replace brick stopped… and never restarted again.

The replace brick reported back 
“Number of files migrated = 281   Migration complete” 
but when I look on Server3 Brick I get:

       [root@Server3 brick2]# find . -type f -print | wc -l

16

I’m missing 265 files.. (they still exist on the OLD brick.. but how can I move it?)

If I try to add the old brick back with another brick on the new server as such

[root@Server1 ~]# gluster volume add-brick Storage1 Server1:/exp/br02/brick2 Server3:/exp/br01/brick1

volume add-brick: failed: /exp/br02/brick2 is already part of a volume

Im fearfull of running:

[root@Server1 ~]# setfattr -n trusted.glusterfs.volume-id -v 0x$(grep volume-id /var/lib/glusterd/vols/$volname/info | cut -d= -f2 | sed 's/-//g')
/exp/br02/brick2

Although it should allow me to add the brick

Gluster Heal info returns:

[root@Server2 ~]# gluster volume heal Storage1 info

Brick Server1:/exp/br01/brick1/

Number of entries: 0

Brick Server2:/exp/br01/brick1/

Number of entries: 0

Brick Server1:/exp/br02/brick2

Status: Transport endpoint is not connected

Brick Server2:/exp/br02/brick2/

Number of entries: 0

I have restarted glusterd numerous times.

at this time I’m not sure where to go from here… I know that the Server1:/exp/br02/brick2 still has all the data.. and Server3:/exp/br01/brick1 is not complete

How do I actually get the brick to replicate?

How can I add Server1:/exp/br02/brick2 back into the trusted pool if I can’t replicate it, or re-add it?

How can I fix this to get it back into a replicated state between the three servers?

Thomas

----DATA----

Gluster volume info at this point

[root@Server1 ~]# gluster volume info

Volume Name: Storage1

Type: Distributed-Replicate

Volume ID: 9616ce42-48bd-4fe3-883f-decd6c4fcd00

Status: Started

Number of Bricks: 2 x 2 = 4

Transport-type: tcp

Bricks:

Brick1: Server1:/exp/br01/brick1

Brick2: Server2:/exp/br01/brick1

Brick3: server3:/exp/br02/brick2

Brick4: Server2:/exp/br02/brick2

Options Reconfigured:

diagnostics.brick-log-level: WARNING

diagnostics.client-log-level: WARNING

cluster.entry-self-heal: off

cluster.data-self-heal: off

cluster.metadata-self-heal: off

performance.cache-size: 1024MB

performance.cache-max-file-size: 2MB

performance.cache-refresh-timeout: 1

performance.stat-prefetch: off

performance.read-ahead: on

performance.quick-read: off

performance.write-behind-window-size: 4MB

performance.flush-behind: on

performance.write-behind: on

performance.io-thread-count: 32

performance.io-cache: on

network.ping-timeout: 2

nfs.addr-namelookup: off

performance.strict-write-ordering: on

[root@Server1 ~]#

[root@server3 brick2]# gluster volume heal Storage1 info

Brick Server1:/exp/br01/brick1/

Number of entries: 0

Brick Server2:/exp/br01/brick1/

Number of entries: 0

Brick Server3:/exp/br02/brick2/

Number of entries: 0

Brick Server2:/exp/br02/brick2/

Number of entries: 0

Gluster LOG ( there are a few errors but I’m not sure how to decipher them)

[2015-02-14 06:29:19.862809] I [MSGID: 106005] [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick Server1:/exp/br02/brick2 has disconnected from
 glusterd.

[2015-02-14 06:29:19.862836] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/7565ec897c6454bd3e2f4800250a7221.socket failed (Invalid argument)

[2015-02-14 06:29:19.862853] I [MSGID: 106006] [glusterd-handler.c:4257:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd.

[2015-02-14 06:29:19.953762] I [glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick /exp/br02/brick2 on port 49153

[2015-02-14 06:31:12.977450] I [glusterd-replace-brick.c:99:__glusterd_handle_replace_brick] 0-management: Received replace brick req

[2015-02-14 06:31:12.977495] I [glusterd-replace-brick.c:154:__glusterd_handle_replace_brick] 0-management: Received replace brick status request

[2015-02-14 06:31:13.048852] I [glusterd-replace-brick.c:1412:rb_update_srcbrick_port] 0-: adding src-brick port no

[2015-02-14 06:31:19.588380] I [glusterd-replace-brick.c:99:__glusterd_handle_replace_brick] 0-management: Received replace brick req

[2015-02-14 06:31:19.588422] I [glusterd-replace-brick.c:154:__glusterd_handle_replace_brick] 0-management: Received replace brick status request

[2015-02-14 06:31:19.661101] I [glusterd-replace-brick.c:1412:rb_update_srcbrick_port] 0-: adding src-brick port no

[2015-02-14 06:31:45.115355] W [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed

[2015-02-14 06:31:45.118597] I [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Storage1

[2015-02-14 06:32:10.956357] I [glusterd-replace-brick.c:99:__glusterd_handle_replace_brick] 0-management: Received replace brick req

[2015-02-14 06:32:10.956385] I [glusterd-replace-brick.c:154:__glusterd_handle_replace_brick] 0-management: Received replace brick commit request

[2015-02-14 06:32:11.028472] I [glusterd-replace-brick.c:1412:rb_update_srcbrick_port] 0-: adding src-brick port no

[2015-02-14 06:32:12.122552] I [glusterd-utils.c:6276:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV3 successfully

[2015-02-14 06:32:12.131836] I [glusterd-utils.c:6281:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV1 successfully

[2015-02-14 06:32:12.141107] I [glusterd-utils.c:6286:glusterd_nfs_pmap_deregister] 0-: De-registered NFSV3 successfully

[2015-02-14 06:32:12.150375] I [glusterd-utils.c:6291:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v4 successfully

[2015-02-14 06:32:12.159630] I [glusterd-utils.c:6296:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v1 successfully

[2015-02-14 06:32:12.168889] I [glusterd-utils.c:6301:glusterd_nfs_pmap_deregister] 0-: De-registered ACL v3 successfully

[2015-02-14 06:32:13.254689] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600

[2015-02-14 06:32:13.254799] W [socket.c:2992:socket_connect] 0-management: Ignore failed connection attempt on , (No such file or directory)

[2015-02-14 06:32:13.257790] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600

[2015-02-14 06:32:13.257908] W [socket.c:2992:socket_connect] 0-management: Ignore failed connection attempt on , (No such file or directory)

[2015-02-14 06:32:13.258031] W [socket.c:611:__socket_rwv] 0-socket.management: writev on 127.0.0.1:1019 failed (Broken pipe)

[2015-02-14 06:32:13.258111] W [socket.c:611:__socket_rwv] 0-socket.management: writev on 127.0.0.1:1021 failed (Broken pipe)

[2015-02-14 06:32:13.258130] W [socket.c:611:__socket_rwv] 0-socket.management: writev on 10.45.16.17:1018 failed (Broken pipe)

[2015-02-14 06:32:13.711948] I [mem-pool.c:545:mem_pool_destroy] 0-management: size=588 max=0 total=0

[2015-02-14 06:32:13.711967] I [mem-pool.c:545:mem_pool_destroy] 0-management: size=124 max=0 total=0

[2015-02-14 06:32:13.712008] I [mem-pool.c:545:mem_pool_destroy] 0-management: size=588 max=0 total=0

[2015-02-14 06:32:13.712021] I [mem-pool.c:545:mem_pool_destroy] 0-management: size=124 max=0 total=0

[2015-02-14 06:32:13.731311] I [mem-pool.c:545:mem_pool_destroy] 0-management: size=588 max=0 total=0

[2015-02-14 06:32:13.731326] I [mem-pool.c:545:mem_pool_destroy] 0-management: size=124 max=0 total=0

[2015-02-14 06:32:13.731356] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /exp/br02/brick2 on port 49153

[2015-02-14 06:32:13.823129] I [socket.c:2344:socket_event_handler] 0-transport: disconnecting now

[2015-02-14 06:32:13.840668] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/7565ec897c6454bd3e2f4800250a7221.socket failed (Invalid argument)

[2015-02-14 06:32:13.840693] I [MSGID: 106006] [glusterd-handler.c:4257:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd.

[2015-02-14 06:32:13.840712] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/ac4c043d3c6a2e5159c86e8c75c51829.socket failed (Invalid argument)

[2015-02-14 06:32:13.840728] I [MSGID: 106006] [glusterd-handler.c:4257:__glusterd_nodesvc_rpc_notify] 0-management: glustershd has disconnected from glusterd.

[2015-02-14 06:32:14.729667] E [glusterd-rpc-ops.c:1169:__glusterd_commit_op_cbk] 0-management: Received commit RJT from uuid: 294aa603-ec24-44b9-864b-0fe743faa8d9

[2015-02-14 06:32:14.743623] E [glusterd-rpc-ops.c:1169:__glusterd_commit_op_cbk] 0-management: Received commit RJT from uuid: 92aabaf4-4b6c-48da-82b6-c465aff2ec6d

[2015-02-14 06:32:18.762975] W [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed

[2015-02-14 06:32:18.764552] I [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Storage1

[2015-02-14 06:32:18.769051] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not
 match. Not aggregating tasks status.

[2015-02-14 06:32:18.769070] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick

[2015-02-14 06:32:18.771095] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not
 match. Not aggregating tasks status.

[2015-02-14 06:32:18.771108] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick

[2015-02-14 06:32:48.570796] W [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed

[2015-02-14 06:32:48.572352] I [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Storage1

[2015-02-14 06:32:48.576899] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not
 match. Not aggregating tasks status.

[2015-02-14 06:32:48.576918] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick

[2015-02-14 06:32:48.578982] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not
 match. Not aggregating tasks status.

[2015-02-14 06:32:48.579001] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick

[2015-02-14 06:36:57.840738] W [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed

[2015-02-14 06:36:57.842370] I [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Storage1

[2015-02-14 06:36:57.846919] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not
 match. Not aggregating tasks status.

[2015-02-14 06:36:57.846941] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick

[2015-02-14 06:36:57.849026] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not
 match. Not aggregating tasks status.

[2015-02-14 06:36:57.849046] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick

[2015-02-14 06:37:20.208081] W [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed

[2015-02-14 06:37:20.211279] I [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Storage1

[2015-02-14 06:37:20.215792] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not
 match. Not aggregating tasks status.

[2015-02-14 06:37:20.215809] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick

[2015-02-14 06:37:20.216295] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not
 match. Not aggregating tasks status.

[2015-02-14 06:37:20.216308] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

-- 

Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users