Hi Mauro,
I looked into the getxattr output provided by you and found that everything is fine with versions on the root of the bricks.
I would recommend to wait for re-balance to complete.
Keep posting output of following -
1 - re-balance status
2 - gluster volume status
3 - gluster v heal <volname> info
Are you able to access the files/dirs from mount point?
Let's try to find out the issue one by one.
---
Ashish
From: "Mauro Tridici" <mauro.tridici@xxxxxxx>
To: "Ashish Pandey" <aspandey@xxxxxxxxxx>
Cc: "gluster-users" <gluster-users@xxxxxxxxxxx>
Sent: Monday, October 8, 2018 3:33:21 PM
Subject: Re: Rebalance failed on Distributed Disperse volume based on 3.12.14 version
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
To: "Ashish Pandey" <aspandey@xxxxxxxxxx>
Cc: "gluster-users" <gluster-users@xxxxxxxxxxx>
Sent: Monday, October 8, 2018 3:33:21 PM
Subject: Re: Rebalance failed on Distributed Disperse volume based on 3.12.14 version
Hi Ashish,
the rebalance is still running. It moved about 49.3 TB of 78 TB (estimated).
The initial amount of data saved on the s01, s02 and s03 servers was about 156TB, so I think that half of 156TB (78TB) should be moved to the new 3 servers (s04, s05 and s06).
[root@s01 status]# gluster volume rebalance tier2 status
Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 553955 20.3TB 2356786 0 61942 in progress 57:42:14
s02-stg 293175 14.8TB 976960 0 30489 in progress 57:42:15
s03-stg 293758 14.2TB 990518 0 30464 in progress 57:42:15
s04-stg 0 0Bytes 0 0 0 failed 0:00:37
s05-stg 0 0Bytes 0 0 0 completed 48:33:03
s06-stg 0 0Bytes 0 0 0 completed 48:33:02
Estimated time left for rebalance to complete : 981:23:02
volume rebalance: tier2: success
In attachment you will find the outputs required.
Thank you,
Mauro
Il giorno 08 ott 2018, alle ore 11:44, Ashish Pandey <aspandey@xxxxxxxxxx> ha scritto:Hi Mauro,What is the status of rebalace now?Could you please give output of following for all the bricks -getfattr -m. -d -e hex <root path of athe brick>You have to go to all the nodes and for all the bricks on that node you have to run above command.Example: on s01getfattr -m. -d -e hex /gluster/mnt1/brickKeep output from one node in one file si that it will be easy to analyze.---AshishFrom: "Mauro Tridici" <mauro.tridici@xxxxxxx>
To: "Nithya Balachandran" <nbalacha@xxxxxxxxxx>
Cc: "gluster-users" <gluster-users@xxxxxxxxxxx>
Sent: Monday, October 8, 2018 2:27:35 PM
Subject: Re: Rebalance failed on Distributed Disperse volume based on 3.12.14 versionHi Nithya,thank you, my answers are in lines.Il giorno 08 ott 2018, alle ore 10:43, Nithya Balachandran <nbalacha@xxxxxxxxxx> ha scritto:Hi Mauro,Yes, a rebalance consists of 2 operations for every directory:
- Fix the layout for the new volume config (newly added or removed bricks)
- Migrate files to their new hashed subvols based on the new layout
Are you running a rebalance because you added new bricks to the volume ? As per an earlier email you have already run a fix-layout.Yes, we added new bricks to the volume and we already executed fix-layout before.On s04, please check the rebalance log file to see why the rebalance failed.On s04, rebalance failed after the following errors (before these lines no errors were found):[2018-10-06 00:13:37.359634] I [MSGID: 109063] [dht-layout.c:716:dht_layout_normalize] 0-tier2-dht: Found anomalies in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=2 overlaps=0[2018-10-06 00:13:37.362424] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-7: Failed to get size and version [Input/output error][2018-10-06 00:13:37.362504] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-7: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:37.362525] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-7: Failed to update version and size [Input/output error][2018-10-06 00:13:37.363105] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-8: Failed to get size and version [Input/output error][2018-10-06 00:13:37.363163] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-8: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:37.363180] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-8: Failed to update version and size [Input/output error][2018-10-06 00:13:37.364920] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-11: Failed to get size and version [Input/output error][2018-10-06 00:13:37.364969] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-11: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:37.364985] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-11: Failed to update version and size [Input/output error][2018-10-06 00:13:37.366864] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-6: Failed to get size and version [Input/output error][2018-10-06 00:13:37.366912] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-6: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:37.366926] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-6: Failed to update version and size [Input/output error][2018-10-06 00:13:37.374818] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-9: Failed to get size and version [Input/output error][2018-10-06 00:13:37.374866] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:37.374879] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-9: Failed to update version and size [Input/output error][2018-10-06 00:13:37.406076] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-10: Failed to get size and version [Input/output error][2018-10-06 00:13:37.406145] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-10: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:37.406183] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-10: Failed to update version and size [Input/output error][2018-10-06 00:13:51.039835] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-11: Failed to get size and version [Input/output error][2018-10-06 00:13:51.039911] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-11, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error][2018-10-06 00:13:51.039944] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-11: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:51.039958] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-11: Failed to update version and size [Input/output error][2018-10-06 00:13:51.040441] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-7: Failed to get size and version [Input/output error][2018-10-06 00:13:51.040480] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-7, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error][2018-10-06 00:13:51.040518] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-7: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:51.040534] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-7: Failed to update version and size [Input/output error][2018-10-06 00:13:51.061789] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-9: Failed to get size and version [Input/output error][2018-10-06 00:13:51.061830] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-9, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error][2018-10-06 00:13:51.061859] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:51.061873] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-9: Failed to update version and size [Input/output error][2018-10-06 00:13:51.062283] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-8: Failed to get size and version [Input/output error][2018-10-06 00:13:51.062323] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-8, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error][2018-10-06 00:13:51.062353] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-8: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:51.062367] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-8: Failed to update version and size [Input/output error][2018-10-06 00:13:51.064613] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-6: Failed to get size and version [Input/output error][2018-10-06 00:13:51.064655] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-6, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error][2018-10-06 00:13:51.064685] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-6: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:51.064700] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-6: Failed to update version and size [Input/output error][2018-10-06 00:13:51.064727] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-10: Failed to get size and version [Input/output error][2018-10-06 00:13:51.064766] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-10, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error][2018-10-06 00:13:51.064794] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-10: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:51.064815] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-10: Failed to update version and size [Input/output error][2018-10-06 00:13:53.695948] I [dht-rebalance.c:4512:gf_defrag_start_crawl] 0-tier2-dht: gf_defrag_start_crawl using commit hash 3720343841[2018-10-06 00:13:53.696837] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-11: Failed to get size and version [Input/output error][2018-10-06 00:13:53.696906] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-11: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:53.696924] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-11: Failed to update version and size [Input/output error][2018-10-06 00:13:53.697549] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-7: Failed to get size and version [Input/output error][2018-10-06 00:13:53.697599] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-7: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:53.697620] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-7: Failed to update version and size [Input/output error][2018-10-06 00:13:53.704120] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-8: Failed to get size and version [Input/output error][2018-10-06 00:13:53.704262] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-8: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:53.704342] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-8: Failed to update version and size [Input/output error][2018-10-06 00:13:53.707260] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-10: Failed to get size and version [Input/output error][2018-10-06 00:13:53.707312] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-10: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:53.707329] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-10: Failed to update version and size [Input/output error][2018-10-06 00:13:53.718301] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-6: Failed to get size and version [Input/output error][2018-10-06 00:13:53.718350] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-6: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:53.718367] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-6: Failed to update version and size [Input/output error][2018-10-06 00:13:55.626130] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-9: Failed to get size and version [Input/output error][2018-10-06 00:13:55.626207] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:13:55.626228] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-9: Failed to update version and size [Input/output error][2018-10-06 00:13:55.626231] I [MSGID: 109081] [dht-common.c:4379:dht_setxattr] 0-tier2-dht: fixing the layout of /[2018-10-06 00:13:55.862374] I [dht-rebalance.c:5063:gf_defrag_status_get] 0-glusterfs: TIME: Estimated total time to complete (size)= 0 seconds, seconds left = 0[2018-10-06 00:13:55.862440] I [MSGID: 109028] [dht-rebalance.c:5143:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 20.00 secs[2018-10-06 00:13:55.862460] I [MSGID: 109028] [dht-rebalance.c:5147:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0[2018-10-06 00:14:12.476927] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-11: Failed to get size and version [Input/output error][2018-10-06 00:14:12.477020] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-11, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error][2018-10-06 00:14:12.477077] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-11: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:14:12.477094] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-11: Failed to update version and size [Input/output error][2018-10-06 00:14:12.477644] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-7: Failed to get size and version [Input/output error][2018-10-06 00:14:12.477695] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-7, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error][2018-10-06 00:14:12.477726] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-7: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:14:12.477740] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-7: Failed to update version and size [Input/output error][2018-10-06 00:14:12.477853] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-8: Failed to get size and version [Input/output error][2018-10-06 00:14:12.477894] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-8, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error][2018-10-06 00:14:12.477923] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-8: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:14:12.477937] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-8: Failed to update version and size [Input/output error][2018-10-06 00:14:12.486862] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-6: Failed to get size and version [Input/output error][2018-10-06 00:14:12.486902] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-6, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error][2018-10-06 00:14:12.486929] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-6: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:14:12.486944] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-6: Failed to update version and size [Input/output error][2018-10-06 00:14:12.493872] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-10: Failed to get size and version [Input/output error][2018-10-06 00:14:12.493912] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-10, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error][2018-10-06 00:14:12.493939] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-10: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:14:12.493954] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-10: Failed to update version and size [Input/output error][2018-10-06 00:14:12.494560] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-9: Failed to get size and version [Input/output error][2018-10-06 00:14:12.494598] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-9, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error][2018-10-06 00:14:12.494624] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4)[2018-10-06 00:14:12.494640] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-9: Failed to update version and size [Input/output error][2018-10-06 00:14:12.795320] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-7: Failed to get size and version [Input/output error][2018-10-06 00:14:12.795366] E [MSGID: 109039] [dht-common.c:3113:dht_find_local_subvol_cbk] 0-tier2-dht: getxattr err for dir [Input/output error][2018-10-06 00:14:12.795796] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-11: Failed to get size and version [Input/output error][2018-10-06 00:14:12.795834] E [MSGID: 109039] [dht-common.c:3113:dht_find_local_subvol_cbk] 0-tier2-dht: getxattr err for dir [Input/output error][2018-10-06 00:14:12.804770] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-6: Failed to get size and version [Input/output error][2018-10-06 00:14:12.804803] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-10: Failed to get size and version [Input/output error][2018-10-06 00:14:12.804811] E [MSGID: 109039] [dht-common.c:3113:dht_find_local_subvol_cbk] 0-tier2-dht: getxattr err for dir [Input/output error][2018-10-06 00:14:12.804850] E [MSGID: 109039] [dht-common.c:3113:dht_find_local_subvol_cbk] 0-tier2-dht: getxattr err for dir [Input/output error][2018-10-06 00:14:12.808500] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-8: Failed to get size and version [Input/output error][2018-10-06 00:14:12.808563] E [MSGID: 109039] [dht-common.c:3113:dht_find_local_subvol_cbk] 0-tier2-dht: getxattr err for dir [Input/output error][2018-10-06 00:14:12.812431] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-9: Failed to get size and version [Input/output error][2018-10-06 00:14:12.812468] E [MSGID: 109039] [dht-common.c:3113:dht_find_local_subvol_cbk] 0-tier2-dht: getxattr err for dir [Input/output error][2018-10-06 00:14:12.812497] E [MSGID: 0] [dht-rebalance.c:4336:dht_get_local_subvols_and_nodeuuids] 0-tier2-dht: local subvolume determination failed with error: 5 [Input/output error][2018-10-06 00:14:12.812700] I [MSGID: 109028] [dht-rebalance.c:5143:gf_defrag_status_get] 0-tier2-dht: Rebalance is failed. Time taken is 37.00 secs[2018-10-06 00:14:12.812720] I [MSGID: 109028] [dht-rebalance.c:5147:gf_defrag_status_get] 0-tier2-dht: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0[2018-10-06 00:14:12.812870] W [glusterfsd.c:1375:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25) [0x7efe75d18e25] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x5623973d64b5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5623973d632b] ) 0-: received signum (15), shutting downRegards,MauroRegards,NithyaOn 8 October 2018 at 13:22, Mauro Tridici <mauro.tridici@xxxxxxx> wrote:Hi All,for your information, this is the current rebalance status:[root@s01 ~]# gluster volume rebalance tier2 statusNode Rebalanced-files size scanned failures skipped status run time in h:m:s--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------localhost 551922 20.3TB 2349397 0 61849 in progress 55:25:38s02-stg 287631 13.2TB 959954 0 30262 in progress 55:25:39s03-stg 288523 12.7TB 973111 0 30220 in progress 55:25:39s04-stg 0 0Bytes 0 0 0 failed 0:00:37s05-stg 0 0Bytes 0 0 0 completed 48:33:03s06-stg 0 0Bytes 0 0 0 completed 48:33:02Estimated time left for rebalance to complete : 1023:49:56volume rebalance: tier2: successRebalance is migrating files on s05, s06 servers and on s04 too (although it is marked as failed).s05 and s06 tasks are completed.Questions:1) it seems that rebalance is moving files, but it is fixing the layout also, is it normal?2) when rebalance will be completed, what we need to do before return the gluster storage to the users? We have to launch rebalance again in order to involve s04 server too or a fix-layout to eventually fix some error on s04?Thank you very much,MauroIl giorno 07 ott 2018, alle ore 10:29, Mauro Tridici <mauro.tridici@xxxxxxx> ha scritto:<tier2-rebalance.log.gz>Hi All,some important updates about the issue mentioned below.After rebalance failed on all the servers, I decided to:- stop gluster volume- reboot the servers- start gluster volume- change some gluster volume options- start the rebalance againThe options that I changed are listed below after reading some threads on gluster users mailing list:BEFORE CHANGE:gluster volume set tier2 network.ping-timeout 02gluster volume set all cluster.brick-multiplex ongluster volume set tier2 cluster.server-quorum-ratio 51%gluster volume set tier2 cluster.server-quorum-type servergluster volume set tier2 cluster.quorum-type autoAFTER CHANGE:gluster volume set tier2 network.ping-timeout 42gluster volume set all cluster.brick-multiplex offgluster volume set tier2 cluster.server-quorum-ratio nonegluster volume set tier2 cluster.server-quorum-type nonegluster volume set tier2 cluster.quorum-type noneThe result was that rebalance starts moving data from s01, s02 ed s03 servers to s05 and s06 servers (the new added ones), but it failed on s04 server after 37 seconds.The rebalance is still running and moving data as you can see from the output:[root@s01 ~]# gluster volume rebalance tier2 statusNode Rebalanced-files size scanned failures skipped status run time in h:m:s--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------localhost 286680 12.6TB 1217960 0 43343 in progress 32:10:24s02-stg 126291 12.4TB 413077 0 21932 in progress 32:10:25s03-stg 126516 11.9TB 433014 0 21870 in progress 32:10:25s04-stg 0 0Bytes 0 0 0 failed 0:00:37s05-stg 0 0Bytes 0 0 0 in progress 32:10:25s06-stg 0 0Bytes 0 0 0 in progress 32:10:25Estimated time left for rebalance to complete : 624:47:48volume rebalance: tier2: successWhen rebalance will be completed, we are planning to re-launch it to try to involve s04 server also.Do you have some idea about what happened in my previous message and why, now, rebalance it’s running although it’s not involve s04 server?In attachment the complete tier2-rebalance.log file related to s04 server.Thank you very much for your help,MauroIl giorno 06 ott 2018, alle ore 02:01, Mauro Tridici <mauro.tridici@xxxxxxx> ha scritto:<rebalance_log.txt>Hi All,since we need to restore gluster storage as soon as possible, we decided to ignore the few files that could be lost and to go ahead.So we cleaned all bricks content of servers s04, s05 and s06.As planned some days ago, we executed the following commands:gluster peer detach s04gluster peer detach s05gluster peer detach s06gluster peer probe s04gluster peer probe s05gluster peer probe s06gluster volume add-brick tier2 s04-stg:/gluster/mnt1/brick s05-stg:/gluster/mnt1/brick s06-stg:/gluster/mnt1/brick s04-stg:/gluster/mnt2/brick s05-stg:/gluster/mnt2/brick s06-stg:/gluster/mnt2/brick s04-stg:/gluster/mnt3/brick s05-stg:/gluster/mnt3/brick s06-stg:/gluster/mnt3/brick s04-stg:/gluster/mnt4/brick s05-stg:/gluster/mnt4/brick s06-stg:/gluster/mnt4/brick s04-stg:/gluster/mnt5/brick s05-stg:/gluster/mnt5/brick s06-stg:/gluster/mnt5/brick s04-stg:/gluster/mnt6/brick s05-stg:/gluster/mnt6/brick s06-stg:/gluster/mnt6/brick s04-stg:/gluster/mnt7/brick s05-stg:/gluster/mnt7/brick s06-stg:/gluster/mnt7/brick s04-stg:/gluster/mnt8/brick s05-stg:/gluster/mnt8/brick s06-stg:/gluster/mnt8/brick s04-stg:/gluster/mnt9/brick s05-stg:/gluster/mnt9/brick s06-stg:/gluster/mnt9/brick s04-stg:/gluster/mnt10/brick s05-stg:/gluster/mnt10/brick s06-stg:/gluster/mnt10/brick s04-stg:/gluster/mnt11/brick s05-stg:/gluster/mnt11/brick s06-stg:/gluster/mnt11/brick s04-stg:/gluster/mnt12/brick s05-stg:/gluster/mnt12/brick s06-stg:/gluster/mnt12/brick forcegluster volume rebalance tier2 fix-layout startEverything seem to be fine and fix-layout ended.[root@s01 ~]# gluster volume rebalance tier2 statusNode status run time in h:m:s--------- ----------- ------------localhost fix-layout completed 12:11:6s02-stg fix-layout completed 12:11:18s03-stg fix-layout completed 12:11:12s04-stg fix-layout completed 12:11:20s05-stg fix-layout completed 12:11:14s06-stg fix-layout completed 12:10:47volume rebalance: tier2: success[root@s01 ~]# gluster volume infoVolume Name: tier2Type: Distributed-DisperseVolume ID: a28d88c5-3295-4e35-98d4-210b3af9358cStatus: StartedSnapshot Count: 0Number of Bricks: 12 x (4 + 2) = 72Transport-type: tcpBricks:Brick1: s01-stg:/gluster/mnt1/brickBrick2: s02-stg:/gluster/mnt1/brickBrick3: s03-stg:/gluster/mnt1/brickBrick4: s01-stg:/gluster/mnt2/brickBrick5: s02-stg:/gluster/mnt2/brickBrick6: s03-stg:/gluster/mnt2/brickBrick7: s01-stg:/gluster/mnt3/brickBrick8: s02-stg:/gluster/mnt3/brickBrick9: s03-stg:/gluster/mnt3/brickBrick10: s01-stg:/gluster/mnt4/brickBrick11: s02-stg:/gluster/mnt4/brickBrick12: s03-stg:/gluster/mnt4/brickBrick13: s01-stg:/gluster/mnt5/brickBrick14: s02-stg:/gluster/mnt5/brickBrick15: s03-stg:/gluster/mnt5/brickBrick16: s01-stg:/gluster/mnt6/brickBrick17: s02-stg:/gluster/mnt6/brickBrick18: s03-stg:/gluster/mnt6/brickBrick19: s01-stg:/gluster/mnt7/brickBrick20: s02-stg:/gluster/mnt7/brickBrick21: s03-stg:/gluster/mnt7/brickBrick22: s01-stg:/gluster/mnt8/brickBrick23: s02-stg:/gluster/mnt8/brickBrick24: s03-stg:/gluster/mnt8/brickBrick25: s01-stg:/gluster/mnt9/brickBrick26: s02-stg:/gluster/mnt9/brickBrick27: s03-stg:/gluster/mnt9/brickBrick28: s01-stg:/gluster/mnt10/brickBrick29: s02-stg:/gluster/mnt10/brickBrick30: s03-stg:/gluster/mnt10/brickBrick31: s01-stg:/gluster/mnt11/brickBrick32: s02-stg:/gluster/mnt11/brickBrick33: s03-stg:/gluster/mnt11/brickBrick34: s01-stg:/gluster/mnt12/brickBrick35: s02-stg:/gluster/mnt12/brickBrick36: s03-stg:/gluster/mnt12/brickBrick37: s04-stg:/gluster/mnt1/brickBrick38: s05-stg:/gluster/mnt1/brickBrick39: s06-stg:/gluster/mnt1/brickBrick40: s04-stg:/gluster/mnt2/brickBrick41: s05-stg:/gluster/mnt2/brickBrick42: s06-stg:/gluster/mnt2/brickBrick43: s04-stg:/gluster/mnt3/brickBrick44: s05-stg:/gluster/mnt3/brickBrick45: s06-stg:/gluster/mnt3/brickBrick46: s04-stg:/gluster/mnt4/brickBrick47: s05-stg:/gluster/mnt4/brickBrick48: s06-stg:/gluster/mnt4/brickBrick49: s04-stg:/gluster/mnt5/brickBrick50: s05-stg:/gluster/mnt5/brickBrick51: s06-stg:/gluster/mnt5/brickBrick52: s04-stg:/gluster/mnt6/brickBrick53: s05-stg:/gluster/mnt6/brickBrick54: s06-stg:/gluster/mnt6/brickBrick55: s04-stg:/gluster/mnt7/brickBrick56: s05-stg:/gluster/mnt7/brickBrick57: s06-stg:/gluster/mnt7/brickBrick58: s04-stg:/gluster/mnt8/brickBrick59: s05-stg:/gluster/mnt8/brickBrick60: s06-stg:/gluster/mnt8/brickBrick61: s04-stg:/gluster/mnt9/brickBrick62: s05-stg:/gluster/mnt9/brickBrick63: s06-stg:/gluster/mnt9/brickBrick64: s04-stg:/gluster/mnt10/brickBrick65: s05-stg:/gluster/mnt10/brickBrick66: s06-stg:/gluster/mnt10/brickBrick67: s04-stg:/gluster/mnt11/brickBrick68: s05-stg:/gluster/mnt11/brickBrick69: s06-stg:/gluster/mnt11/brickBrick70: s04-stg:/gluster/mnt12/brickBrick71: s05-stg:/gluster/mnt12/brickBrick72: s06-stg:/gluster/mnt12/brickOptions Reconfigured:network.ping-timeout: 42features.scrub: Activefeatures.bitrot: onfeatures.inode-quota: onfeatures.quota: onperformance.client-io-threads: oncluster.min-free-disk: 10cluster.quorum-type: nonetransport.address-family: inetnfs.disable: onserver.event-threads: 4client.event-threads: 4cluster.lookup-optimize: onperformance.readdir-ahead: onperformance.parallel-readdir: offcluster.readdir-optimize: onfeatures.cache-invalidation: onfeatures.cache-invalidation-timeout: 600performance.stat-prefetch: onperformance.cache-invalidation: onperformance.md-cache-timeout: 600network.inode-lru-limit: 50000performance.io-cache: offdisperse.cpu-extensions: autoperformance.io-thread-count: 16features.quota-deem-statfs: onfeatures.default-soft-limit: 90cluster.server-quorum-type: nonediagnostics.latency-measurement: ondiagnostics.count-fop-hits: oncluster.brick-multiplex: offcluster.server-quorum-ratio: 51%The last step should be the data rebalance between the servers, but rebalance failed soon with a lot of errors like the following ones:[2018-10-05 23:48:38.644978] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-tier2-client-70: Server lk version = 1[2018-10-05 23:48:44.735323] I [dht-rebalance.c:4512:gf_defrag_start_crawl] 0-tier2-dht: gf_defrag_start_crawl using commit hash 3720331860[2018-10-05 23:48:44.736205] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-7: Failed to get size and version [Input/output error][2018-10-05 23:48:44.736266] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-7: Insufficient available children for this request (have 0, need 4)[2018-10-05 23:48:44.736282] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-7: Failed to update version and size [Input/output error][2018-10-05 23:48:44.736377] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-8: Failed to get size and version [Input/output error][2018-10-05 23:48:44.736436] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-8: Insufficient available children for this request (have 0, need 4)[2018-10-05 23:48:44.736459] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-8: Failed to update version and size [Input/output error][2018-10-05 23:48:44.736460] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-10: Failed to get size and version [Input/output error][2018-10-05 23:48:44.736537] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-9: Failed to get size and version [Input/output error][2018-10-05 23:48:44.736571] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-10: Insufficient available children for this request (have 0, need 4)[2018-10-05 23:48:44.736574] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4)[2018-10-05 23:48:44.736604] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-9: Failed to update version and size [Input/output error][2018-10-05 23:48:44.736604] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-10: Failed to update version and size [Input/output error][2018-10-05 23:48:44.736827] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-11: Failed to get size and version [Input/output error][2018-10-05 23:48:44.736887] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-11: Insufficient available children for this request (have 0, need 4)[2018-10-05 23:48:44.736904] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-11: Failed to update version and size [Input/output error][2018-10-05 23:48:44.740337] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-6: Failed to get size and version [Input/output error][2018-10-05 23:48:44.740381] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-6: Insufficient available children for this request (have 0, need 4)[2018-10-05 23:48:44.740394] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-6: Failed to update version and size [Input/output error][2018-10-05 23:48:50.066103] I [MSGID: 109081] [dht-common.c:4379:dht_setxattr] 0-tier2-dht: fixing the layout of /In attachment you can find the first logs captured during the rebalance execution.In your opinion, is there a way to restore the gluster storage or all the data have been lost?Thank you in advance,MauroIl giorno 04 ott 2018, alle ore 15:31, Mauro Tridici <mauro.tridici@xxxxxxx> ha scritto:Hi Nithya,thank you very much.This is the current “gluster volume info” output after removing bricks (and after peer detach command).[root@s01 ~]# gluster volume infoVolume Name: tier2Type: Distributed-DisperseVolume ID: a28d88c5-3295-4e35-98d4-210b3af9358cStatus: StartedSnapshot Count: 0Number of Bricks: 6 x (4 + 2) = 36Transport-type: tcpBricks:Brick1: s01-stg:/gluster/mnt1/brickBrick2: s02-stg:/gluster/mnt1/brickBrick3: s03-stg:/gluster/mnt1/brickBrick4: s01-stg:/gluster/mnt2/brickBrick5: s02-stg:/gluster/mnt2/brickBrick6: s03-stg:/gluster/mnt2/brickBrick7: s01-stg:/gluster/mnt3/brickBrick8: s02-stg:/gluster/mnt3/brickBrick9: s03-stg:/gluster/mnt3/brickBrick10: s01-stg:/gluster/mnt4/brickBrick11: s02-stg:/gluster/mnt4/brickBrick12: s03-stg:/gluster/mnt4/brickBrick13: s01-stg:/gluster/mnt5/brickBrick14: s02-stg:/gluster/mnt5/brickBrick15: s03-stg:/gluster/mnt5/brickBrick16: s01-stg:/gluster/mnt6/brickBrick17: s02-stg:/gluster/mnt6/brickBrick18: s03-stg:/gluster/mnt6/brickBrick19: s01-stg:/gluster/mnt7/brickBrick20: s02-stg:/gluster/mnt7/brickBrick21: s03-stg:/gluster/mnt7/brickBrick22: s01-stg:/gluster/mnt8/brickBrick23: s02-stg:/gluster/mnt8/brickBrick24: s03-stg:/gluster/mnt8/brickBrick25: s01-stg:/gluster/mnt9/brickBrick26: s02-stg:/gluster/mnt9/brickBrick27: s03-stg:/gluster/mnt9/brickBrick28: s01-stg:/gluster/mnt10/brickBrick29: s02-stg:/gluster/mnt10/brickBrick30: s03-stg:/gluster/mnt10/brickBrick31: s01-stg:/gluster/mnt11/brickBrick32: s02-stg:/gluster/mnt11/brickBrick33: s03-stg:/gluster/mnt11/brickBrick34: s01-stg:/gluster/mnt12/brickBrick35: s02-stg:/gluster/mnt12/brickBrick36: s03-stg:/gluster/mnt12/brickOptions Reconfigured:network.ping-timeout: 0features.scrub: Activefeatures.bitrot: onfeatures.inode-quota: onfeatures.quota: onperformance.client-io-threads: oncluster.min-free-disk: 10cluster.quorum-type: autotransport.address-family: inetnfs.disable: onserver.event-threads: 4client.event-threads: 4cluster.lookup-optimize: onperformance.readdir-ahead: onperformance.parallel-readdir: offcluster.readdir-optimize: onfeatures.cache-invalidation: onfeatures.cache-invalidation-timeout: 600performance.stat-prefetch: onperformance.cache-invalidation: onperformance.md-cache-timeout: 600network.inode-lru-limit: 50000performance.io-cache: offdisperse.cpu-extensions: autoperformance.io-thread-count: 16features.quota-deem-statfs: onfeatures.default-soft-limit: 90cluster.server-quorum-type: serverdiagnostics.latency-measurement: ondiagnostics.count-fop-hits: oncluster.brick-multiplex: oncluster.server-quorum-ratio: 51%Regards,MauroIl giorno 04 ott 2018, alle ore 15:22, Nithya Balachandran <nbalacha@xxxxxxxxxx> ha scritto:Hi Mauro,The files on s04 and s05 can be deleted safely as long as those bricks have been removed from the volume and their brick processes are not running..glusterfs/indices/xattrop/xattrop-* are links to files that need to be healed. .glusterfs/quarantine/stub-00000000-0000-0000-0000-000000000008 links to files that bitrot (if enabled)says are corrupted. (none in this case)
I will get back to you on s06. Can you please provide the output of gluster volume info again?
Regards,NithyaOn 4 October 2018 at 13:47, Mauro Tridici <mauro.tridici@xxxxxxx> wrote:Dear Ashish, Dear Nithya,I’m writing this message only to summarize and simplify the information about the "not migrated” files left on removed bricks on server s04, s05 and s06.In attachment, you can find 3 files (1 file for each server) containing the “not migrated” files lists and related brick number.In particular:
- s04 and s05 bricks contain only not migrated files in hidden directories “/gluster/mnt#/brick/.glusterfs" (I could delete them, doesn’t it?)
- s06 bricks contain
- not migrated files in hidden directories “/gluster/mnt#/brick/.glusterfs”;
- not migrated files with size equal to 0;
- not migrated files with size greater than 0.
I think it was necessary to collect and summarize information to simplify your analysis.Thank you very much,Mauro
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users