Rebalance state stuck or corrupted

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We have had a rebalance operation going on for a few days. After a couple days the rebalance status said "failed". We stopped the rebalance operation by doing gluster volume rebalance gv0 stop. Rebalance log indicated gluster did try to stop the rebalance. However, when we try now to stop the volume or try to restart rebalance it says there's a rebalance operation going on and volume can't be stopped. I tried restarting all the glusterfs-server service (we're using Gluster 3.8.15 on Ubuntu) but that did not help

user@gfs-vm000:~$ sudo gluster volume stop gv0
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: gv0: failed: Staging failed on gfs-vm001. Error: rebalance session is in progress for the volume 'gv0'
Staging failed on gfs-vm017. Error: rebalance session is in progress for the volume 'gv0'
Staging failed on gfs-vm011. Error: rebalance session is in progress for the volume 'gv0'
Staging failed on gfs-vm006. Error: rebalance session is in progress for the volume 'gv0'
Staging failed on gfs-vm003. Error: rebalance session is in progress for the volume 'gv0'
Staging failed on gfs-vm004. Error: rebalance session is in progress for the volume 'gv0'
Staging failed on 10.0.13.9. Error: rebalance session is in progress for the volume 'gv0'
Staging failed on gfs-vm014. Error: rebalance session is in progress for the volume 'gv0'
Staging failed on gfs-vm013. Error: rebalance session is in progress for the volume 'gv0'
Staging failed on gfs-vm002. Error: rebalance session is in progress for the volume 'gv0'
Staging failed on gfs-vm016. Error: rebalance session is in progress for the volume 'gv0'
Staging failed on gfs-vm007. Error: rebalance session is in progress for the volume 'gv0'
Staging failed on gfs-vm010. Error: rebalance session is in progress for the volume 'gv0'
user@gfs-vm000:~$ sudo gluster volume rebalance gv0 stop
volume rebalance: gv0: failed: Rebalance not started.

tail log from gv0-rebalance.log

[2018-05-23 17:32:55.262168] I [MSGID: 109029] [dht-rebalance.c:4260:gf_defrag_stop] 0-: Received stop command on rebalance
[2018-05-23 17:32:55.262221] I [MSGID: 109028] [dht-rebalance.c:4079:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 749380.00 secs
[2018-05-23 17:32:55.262234] I [MSGID: 109028] [dht-rebalance.c:4083:gf_defrag_status_get] 0-glusterfs: Files migrated: 821417, size: 25797609415002, lookups: 1162021, failures: 0, skipped: 1814
[2018-05-23 17:32:55.777149] I [MSGID: 109022] [dht-rebalance.c:1703:dht_migrate_file] 0-gv0-dht: completed migration of /pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl/model.ckpt-50724.data-00002-of-00003 from subvolume gv0-replicate-0 to gv0-replicate-3
[2018-05-23 17:32:55.782048] W [dht-rebalance.c:2826:gf_defrag_process_dir] 0-gv0-dht: Found error from gf_defrag_get_entry
[2018-05-23 17:32:55.782358] E [MSGID: 109111] [dht-rebalance.c:3123:gf_defrag_fix_layout] 0-gv0-dht: gf_defrag_process_dir failed for directory: /pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl
[2018-05-23 17:32:56.115106] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl
[2018-05-23 17:32:56.115586] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill
[2018-05-23 17:32:56.115849] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy/v-zhli2/generated/ende_with_teacher/model
[2018-05-23 17:32:56.116141] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy/v-zhli2/generated/ende_with_teacher
[2018-05-23 17:32:56.116237] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy/v-zhli2/generated
[2018-05-23 17:32:56.116393] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy/v-zhli2
[2018-05-23 17:32:56.116625] E [MSGID: 109016] [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed for /pnrsy
[2018-05-23 17:32:56.129836] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 7
[2018-05-23 17:32:56.130072] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 8
[2018-05-23 17:32:56.130567] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 9
[2018-05-23 17:32:56.131273] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 10
[2018-05-23 17:32:56.131492] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 11
[2018-05-23 17:32:56.131578] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 12
[2018-05-23 17:33:09.164419] I [MSGID: 109022] [dht-rebalance.c:1703:dht_migrate_file] 0-gv0-dht: completed migration of /pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl/model.ckpt-142510.data-00002-of-00003 from subvolume gv0-replicate-0 to gv0-replicate-5
[2018-05-23 17:33:09.386106] I [MSGID: 109022] [dht-rebalance.c:1703:dht_migrate_file] 0-gv0-dht: completed migration of /pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl/model.ckpt-344803.data-00002-of-00003 from subvolume gv0-replicate-0 to gv0-replicate-2
[2018-05-23 17:33:12.463711] I [MSGID: 109022] [dht-rebalance.c:1703:dht_migrate_file] 0-gv0-dht: completed migration of /pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl/model.ckpt-217794.data-00002-of-00003 from subvolume gv0-replicate-0 to gv0-replicate-1
[2018-05-23 17:33:21.525221] I [MSGID: 109022] [dht-rebalance.c:1703:dht_migrate_file] 0-gv0-dht: completed migration of /pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl/model.ckpt-198211.data-00002-of-00003 from subvolume gv0-replicate-0 to gv0-replicate-3
[2018-05-23 17:33:28.644220] I [MSGID: 109022] [dht-rebalance.c:1703:dht_migrate_file] 0-gv0-dht: completed migration of /pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl/model.ckpt-44350.data-00002-of-00003 from subvolume gv0-replicate-0 to gv0-replicate-3
[2018-05-23 17:33:28.647136] I [MSGID: 109028] [dht-rebalance.c:4079:gf_defrag_status_get] 0-gv0-dht: Rebalance is failed. Time taken is 749413.00 secs
[2018-05-23 17:33:28.647162] I [MSGID: 109028] [dht-rebalance.c:4083:gf_defrag_status_get] 0-gv0-dht: Files migrated: 821423, size: 25803971060106, lookups: 1162021, failures: 9, skipped: 1814
[2018-05-23 17:33:28.660680] W [glusterfsd.c:1327:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7ff2df1e46ba] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x55c8f9c89545] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55c8f9c893b4] ) 0-: received signum (15), shutting down

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux