You'd basically have to copy the content of /var/lib/glusterd from fs001 to fs003 with out overwritting fs003's onode specific details. Please ensure you don't touch glusterd.info file and content of /var/lib/glusterd/peers in fs003 and rest can be copied. Post that I expect glusterd will come up.
On Fri, 26 May 2017 at 20:30, Jarsulic, Michael [CRI] <mjarsulic@xxxxxxxxxxxxxxxx> wrote:
Here is some further information on this issue:
The version of gluster we are using is 3.7.6.
Also, the error I found in the cmd history is:
[2017-05-26 04:28:28.332700] : volume remove-brick hpcscratch cri16fs001-ib:/data/brick1/scratch commit : FAILED : Commit failed on cri16fs003-ib. Please check log file for details.
I did not notice this at the time and made an attempt to remove the next brick to migrate the data off of the system. This left the servers in the following state.
fs001 - /var/lib/glusterd/vols/hpcscratch/info
type=0
count=3
status=1
sub_count=0
stripe_count=1
replica_count=1
disperse_count=0
redundancy_count=0
version=42
transport-type=0
volume-id=80b8eeed-1e72-45b9-8402-e01ae0130105
…
op-version=30700
client-op-version=3
quota-version=0
parent_volname=N/A
restored_from_snap=00000000-0000-0000-0000-000000000000
snap-max-hard-limit=256
server.event-threads=8
performance.client-io-threads=on
client.event-threads=8
performance.cache-size=32MB
performance.readdir-ahead=on
brick-0=cri16fs001-ib:-data-brick2-scratch
brick-1=cri16fs003-ib:-data-brick5-scratch
brick-2=cri16fs003-ib:-data-brick6-scratch
fs003 - cat /var/lib/glusterd/vols/hpcscratch/info
type=0
count=4
status=1
sub_count=0
stripe_count=1
replica_count=1
disperse_count=0
redundancy_count=0
version=35
transport-type=0
volume-id=80b8eeed-1e72-45b9-8402-e01ae0130105
…
op-version=30700
client-op-version=3
quota-version=0
parent_volname=N/A
restored_from_snap=00000000-0000-0000-0000-000000000000
snap-max-hard-limit=256
performance.readdir-ahead=on
performance.cache-size=32MB
client.event-threads=8
performance.client-io-threads=on
server.event-threads=8
brick-0=cri16fs001-ib:-data-brick1-scratch
brick-1=cri16fs001-ib:-data-brick2-scratch
brick-2=cri16fs003-ib:-data-brick5-scratch
brick-3=cri16fs003-ib:-data-brick6-scratch
fs001 - /var/lib/glusterd/vols/hpcscratch/node_state.info
rebalance_status=5
status=4
rebalance_op=0
rebalance-id=00000000-0000-0000-0000-000000000000
brick1=cri16fs001-ib:/data/brick2/scratch
count=1
fs003 - /var/lib/glusterd/vols/hpcscratch/node_state.info
rebalance_status=1
status=0
rebalance_op=9
rebalance-id=0184577f-eb64-4af9-924d-91ead0605a1e
brick1=cri16fs001-ib:/data/brick1/scratch
count=1
--
Mike Jarsulic
On 5/26/17, 8:22 AM, "gluster-users-bounces@xxxxxxxxxxx on behalf of Jarsulic, Michael [CRI]" <gluster-users-bounces@xxxxxxxxxxx on behalf of mjarsulic@xxxxxxxxxxxxxxxx> wrote:
Recently, I had some problems with the OS hard drives in my glusterd servers and took one of my systems down for maintenance. The first step was to remove one of the bricks (brick1) hosted on the server (fs001). The data migrated successfully and completed last night. After that, I went to commit the changes and the commit failed. Afterwards, glusterd will not start on one of my servers (fs003). When I check the glusterd logs on fs003 I get the following errors whenever glusterd starts:
[2017-05-26 04:37:21.358932] I [MSGID: 100030] [glusterfsd.c:2318:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6 (args: /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid)
[2017-05-26 04:37:21.382630] I [MSGID: 106478] [glusterd.c:1350:init] 0-management: Maximum allowed open file descriptors set to 65536
[2017-05-26 04:37:21.382712] I [MSGID: 106479] [glusterd.c:1399:init] 0-management: Using /var/lib/glusterd as working directory
[2017-05-26 04:37:21.422858] I [MSGID: 106228] [glusterd.c:433:glusterd_check_gsync_present] 0-glusterd: geo-replication module not installed in the system [No such file or directory]
[2017-05-26 04:37:21.450123] I [MSGID: 106513] [glusterd-store.c:2047:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 30706
[2017-05-26 04:37:21.463812] E [MSGID: 101032] [store.c:434:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/vols/hpcscratch/bricks/cri16fs001-ib:-data-brick1-scratch. [No such file or directory]
[2017-05-26 04:37:21.463866] E [MSGID: 106201] [glusterd-store.c:3042:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: hpcscratch
[2017-05-26 04:37:21.463919] E [MSGID: 101019] [xlator.c:428:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2017-05-26 04:37:21.463943] E [graph.c:322:glusterfs_graph_init] 0-management: initializing translator failed
[2017-05-26 04:37:21.463970] E [graph.c:661:glusterfs_graph_activate] 0-graph: init failed
[2017-05-26 04:37:21.466703] W [glusterfsd.c:1236:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xda) [0x405cba] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x116) [0x405b96] -->/usr/sbin/glusterd(cleanup_and_exit+0x65) [0x4059d5] ) 0-: received signum (0), shutting down
The volume is distribution only. The problem to me looks like it is still expecting brick1 on fs001 to be available in the volume. Is there any way to recover from this? Is there any more information that I can provide?
--
Mike Jarsulic
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.gluster.org_mailman_listinfo_gluster-2Dusers&d=CwICAg&c=Nd1gv_ZWYNIRyZYZmXb18oVfc3lTqv2smA_esABG70U&r=Ak787_FO1coN0_NpWYelxgxjFkaWMHYbXVCdYf-STow&m=zlkeQUf69-VWf8o96ZWr-vxNatuWZvCgYuHnUVj3u70&s=8YOysLTMfJHXS6dSVgP7X0o0LovgLcIuPjfoSY2Kt2Q&e=
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users
--
- Atin (atinm)
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users