Re: question about sync replicate volume after rebooting one node

Atin Mukherjee <amukherj@xxxxxxxxxx> · Wed, 17 Feb 2016 12:24:57 +0530



On 02/17/2016 12:23 PM, Atin Mukherjee wrote:
> 
> 
> On 02/17/2016 12:08 PM, songxin wrote:
>>
>> Hi,
>> But I also don't know why glusterfsd can't be start by glusterd after B
>> node rebooted.The version of glusterfs on A node and B node are both
>> 3.7.6. Can you explain this for me please？
> Its because the GlusterD has failed to start on Node B. I've already
> asked you in another mail to provide the delta of the gv0's info file to
> get to the root cause.
Please ignore this mail as I didn't read your previous reply!
>>
>> Thanks，
>> Xin
>>
>>
>>
>>
>>
>> At 2016-02-17 14:30:21, "Anuradha Talur" <atalur@xxxxxxxxxx> wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "songxin" <songxin_1980@xxxxxxx>
>>>> To: "Atin Mukherjee" <amukherj@xxxxxxxxxx>
>>>> Cc: "Anuradha Talur" <atalur@xxxxxxxxxx>, gluster-users@xxxxxxxxxxx
>>>> Sent: Wednesday, February 17, 2016 11:44:14 AM
>>>> Subject: Re:Re:  question about sync replicate volume after rebooting one node
>>>>
>>>> Hi，
>>>> The version of glusterfs on  A node and B node are both 3.7.6.
>>>> The time on B node is same after rebooting because B node hasn't RTC. Does it
>>>> cause the problem?
>>>>
>>>>
>>>> If I run " gluster volume start gv0 force " the glusterfsd can be started but
>>>> "gluster volume start gv0" don't work.
>>>>
>>> Yes, there is a difference between volume start and volume start force.
>>> When a volume is in "Started" state already, gluster volume start gv0 won't do
>>> anything (meaning it doesn't bring up the dead bricks). When you say start force,
>>> status of glusterfsd's is checked and the glusterfsd's not running are spawned.
>>> Which is the case here in the setup you have.
>>>>
>>>> The file  /var/lib/glusterd/vols/gv0/info on B node as below.
>>>> ...
>>>> type=2
>>>> count=2
>>>> status=1
>>>> sub_count=2
>>>> stripe_count=1
>>>> replica_count=2
>>>> disperse_count=0
>>>> redundancy_count=0
>>>> version=2
>>>> transport-type=0
>>>> volume-id=c4197371-6d01-4477-8cb2-384cda569c27
>>>> username=62e009ea-47c4-46b4-8e74-47cd9c199d94
>>>> password=ef600dcd-42c5-48fc-8004-d13a3102616b
>>>> op-version=3
>>>> client-op-version=3
>>>> quota-version=0
>>>> parent_volname=N/A
>>>> restored_from_snap=00000000-0000-0000-0000-000000000000
>>>> snap-max-hard-limit=256
>>>> performance.readdir-ahead=on
>>>> brick-0=128.224.162.255:-data-brick-gv0
>>>> brick-1=128.224.162.163:-home-wrsadmin-work-tmp-data-brick-gv0
>>>>
>>>>
>>>> The file  /var/lib/glusterd/vols/gv0/info on A node as below.
>>>>
>>>>
>>>> wrsadmin@pek-song1-d1:~/work/tmp$ sudo cat /var/lib/glusterd/vols/gv0/info
>>>> type=2
>>>> count=2
>>>> status=1
>>>> sub_count=2
>>>> stripe_count=1
>>>> replica_count=2
>>>> disperse_count=0
>>>> redundancy_count=0
>>>> version=2
>>>> transport-type=0
>>>> volume-id=c4197371-6d01-4477-8cb2-384cda569c27
>>>> username=62e009ea-47c4-46b4-8e74-47cd9c199d94
>>>> password=ef600dcd-42c5-48fc-8004-d13a3102616b
>>>> op-version=3
>>>> client-op-version=3
>>>> quota-version=0
>>>> parent_volname=N/A
>>>> restored_from_snap=00000000-0000-0000-0000-000000000000
>>>> snap-max-hard-limit=256
>>>> performance.readdir-ahead=on
>>>> brick-0=128.224.162.255:-data-brick-gv0
>>>> brick-1=128.224.162.163:-home-wrsadmin-work-tmp-data-brick-gv0
>>>>
>>>>
>>>> Thanks,
>>>> Xin
>>>>
>>>> At 2016-02-17 12:01:37, "Atin Mukherjee" <amukherj@xxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>> On 02/17/2016 08:23 AM, songxin wrote:
>>>>>> Hi,
>>>>>> Thank you for your immediate and detailed reply.And I have a few more
>>>>>> question about glusterfs.
>>>>>> A node IP is 128.224.162.163.
>>>>>> B node IP is 128.224.162.250.
>>>>>> 1.After reboot B node and start the glusterd service the glusterd log is
>>>>>> as blow.
>>>>>> ...
>>>>>> [2015-12-07 07:54:55.743966] I [MSGID: 101190]
>>>>>> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
>>>>>> with index 2
>>>>>> [2015-12-07 07:54:55.744026] I [MSGID: 101190]
>>>>>> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
>>>>>> with index 1
>>>>>> [2015-12-07 07:54:55.744280] I [MSGID: 106163]
>>>>>> [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack]
>>>>>> 0-management: using the op-version 30706
>>>>>> [2015-12-07 07:54:55.773606] I [MSGID: 106490]
>>>>>> [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req]
>>>>>> 0-glusterd: Received probe from uuid: b6efd8fc-5eab-49d4-a537-2750de644a44
>>>>>> [2015-12-07 07:54:55.777994] E [MSGID: 101076]
>>>>>> [common-utils.c:2954:gf_get_hostname_from_ip] 0-common-utils: Could not
>>>>>> lookup hostname of 128.224.162.163 : Temporary failure in name resolution
>>>>>> [2015-12-07 07:54:55.778290] E [MSGID: 106010]
>>>>>> [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management:
>>>>>> Version of Cksums gv0 differ. local cksum = 2492237955, remote cksum =
>>>>>> 4087388312 on peer 128.224.162.163
>>>>> The above log entry is the reason of the rejection of the peer, most
>>>>> probably its due to the compatibility issue. I believe the gluster
>>>>> versions are different (share gluster versions from both the nodes) in
>>>>> two nodes and you might have hit a bug.
>>>>>
>>>>> Can you share the delta of /var/lib/glusterd/vols/gv0/info file from
>>>>> both the nodes?
>>>>>
>>>>>
>>>>> ~Atin
>>>>>> [2015-12-07 07:54:55.778384] I [MSGID: 106493]
>>>>>> [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd:
>>>>>> Responded to 128.224.162.163 (0), ret: 0
>>>>>> [2015-12-07 07:54:55.928774] I [MSGID: 106493]
>>>>>> [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received
>>>>>> RJT from uuid: b6efd8fc-5eab-49d4-a537-2750de644a44, host:
>>>>>> 128.224.162.163, port: 0
>>>>>> ...
>>>>>> When I run gluster peer status on B node it show as below.
>>>>>> Number of Peers: 1
>>>>>>
>>>>>> Hostname: 128.224.162.163
>>>>>> Uuid: b6efd8fc-5eab-49d4-a537-2750de644a44
>>>>>> State: Peer Rejected (Connected)
>>>>>>
>>>>>> When I run "gluster volume status" on A node  it show as below.
>>>>>>  
>>>>>> Status of volume: gv0
>>>>>> Gluster process                             TCP Port  RDMA Port  Online
>>>>>> Pid
>>>>>> ------------------------------------------------------------------------------
>>>>>> Brick 128.224.162.163:/home/wrsadmin/work/t
>>>>>> mp/data/brick/gv0                           49152     0          Y
>>>>>> 13019
>>>>>> NFS Server on localhost                     N/A       N/A        N
>>>>>> N/A
>>>>>> Self-heal Daemon on localhost               N/A       N/A        Y
>>>>>> 13045
>>>>>>  
>>>>>> Task Status of Volume gv0
>>>>>> ------------------------------------------------------------------------------
>>>>>> There are no active volume tasks
>>>>>>
>>>>>> It looks like the glusterfsd service is ok on A node.
>>>>>>
>>>>>> If because the peer state is Rejected so gluterd didn't start the
>>>>>> glusterfsd?What causes this problem？
>>>>>>
>>>>>>
>>>>>> 2. Is glustershd(self-heal-daemon) the process as below?
>>>>>> root       497  0.8  0.0 432520 18104 ?        Ssl  08:07   0:00
>>>>>> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
>>>>>> /var/lib/glusterd/glustershd/run/gluster ..
>>>>>>
>>>>>> If it is， I want to know if the glustershd is also the bin glusterfsd，
>>>>>> just like glusterd and glusterfs.
>>>>>>
>>>>>> Thanks,
>>>>>> Xin
>>>>>>
>>>>>>
>>>>>> At 2016-02-16 18:53:03, "Anuradha Talur" <atalur@xxxxxxxxxx> wrote:
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "songxin" <songxin_1980@xxxxxxx>
>>>>>>>> To: gluster-users@xxxxxxxxxxx
>>>>>>>> Sent: Tuesday, February 16, 2016 3:59:50 PM
>>>>>>>> Subject:  question about sync replicate volume after
>>>>>>>> 	rebooting one node
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>> I have a question about how to sync volume between two bricks after one
>>>>>>>> node
>>>>>>>> is reboot.
>>>>>>>>
>>>>>>>> There are two node, A node and B node.A node ip is 128.124.10.1 and B
>>>>>>>> node ip
>>>>>>>> is 128.124.10.2.
>>>>>>>>
>>>>>>>> operation steps on A node as below
>>>>>>>> 1. gluster peer probe 128.124.10.2
>>>>>>>> 2. mkdir -p /data/brick/gv0
>>>>>>>> 3.gluster volume create gv0 replica 2 128.124.10.1 :/data/brick/gv0
>>>>>>>> 128.124.10.2 :/data/brick/gv1 force
>>>>>>>> 4. gluster volume start gv0
>>>>>>>> 5.mount -t glusterfs 128.124.10.1 :/gv0 gluster
>>>>>>>>
>>>>>>>> operation steps on B node as below
>>>>>>>> 1 . mkdir -p /data/brick/gv0
>>>>>>>> 2.mount -t glusterfs 128.124.10.1 :/gv0 gluster
>>>>>>>>
>>>>>>>> After all steps above , there a some gluster service process, including
>>>>>>>> glusterd, glusterfs and glusterfsd, running on both A and B node.
>>>>>>>> I can see these servic by command ps aux | grep gluster and command
>>>>>>>> gluster
>>>>>>>> volume status.
>>>>>>>>
>>>>>>>> Now reboot the B node.After B reboot , there are no gluster service
>>>>>>>> running
>>>>>>>> on B node.
>>>>>>>> After I systemctl start glusterd , there is just glusterd service but
>>>>>>>> not
>>>>>>>> glusterfs and glusterfsd on B node.
>>>>>>>> Because glusterfs and glusterfsd are not running so I can't gluster
>>>>>>>> volume
>>>>>>>> heal gv0 full.
>>>>>>>>
>>>>>>>> I want to know why glusterd don't start glusterfs and glusterfsd.
>>>>>>>
>>>>>>> On starting glusterd, glusterfsd should have started by itself.
>>>>>>> Could you share glusterd and brick log (on node B) so that we know why
>>>>>>> glusterfsd
>>>>>>> didn't start?
>>>>>>>
>>>>>>> Do you still see glusterfsd service running on node A? You can try running
>>>>>>> "gluster v start <VOLNAME> force"
>>>>>>> on one of the nodes and check if all the brick processes started.
>>>>>>>
>>>>>>> gluster volume status <VOLNAME> should be able to provide you with gluster
>>>>>>> process status.
>>>>>>>
>>>>>>> On restarting the node, glusterfs process for mount won't start by itself.
>>>>>>> You will have to run
>>>>>>> step 2 on node B again for it.
>>>>>>>
>>>>>>>> How do I restart these services on B node?
>>>>>>>> How do I sync the replicate volume after one node reboot?
>>>>>>>
>>>>>>> Once the glusterfsd process starts on node B too, glustershd --
>>>>>>> self-heal-daemon -- for replicate volume
>>>>>>> should start healing/syncing files that need to be synced. This deamon
>>>>>>> does periodic syncing of files.
>>>>>>>
>>>>>>> If you want to trigger a heal explicitly, you can run gluster volume heal
>>>>>>> <VOLNAME> on one of the servers.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Xin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Gluster-users mailing list
>>>>>>>> Gluster-users@xxxxxxxxxxx
>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> Anuradha.
>>>>>>
>>>>>>
>>>>>>
>>>>>>  
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Gluster-users mailing list
>>>>>> Gluster-users@xxxxxxxxxxx
>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>
>>>>
>>>
>>> -- 
>>> Thanks,
>>> Anuradha.
>>
>>
>>
>>  
>>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-users
> 
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users