Re: Rebalance taking > 2 months

Rusty Bower <rusty@xxxxxxxxxxxxxx> · Sat, 28 Jul 2018 03:41:29 +0200

Just wanted to ping this to see if you guys had any thoughts, or other scripts I can run for this stuff. It's still predicting another 90 days to rebalance this, and performance is basically garbage while it rebalances.
Rusty

On Mon, Jul 23, 2018 at 10:19 AM, Rusty Bower <rusty@xxxxxxxxxxxxxx> wrote:
datanode03 is the newest brick
the bricks had gotten pretty full, which I think might be part of the issue:
- datanode01 /dev/sda1                 51T   48T  3.3T  94% /mnt/data
- datanode02 /dev/sda1                 51T   48T  3.4T  94% /mnt/data
- datanode03 /dev/md0                 128T  4.6T  123T   4% /mnt/data

each of the bricks are on a completely separate disk from the OS

I'll shoot you the log files offline :)

Thanks!
Rusty

On Mon, Jul 23, 2018 at 3:12 AM, Nithya Balachandran <nbalacha@xxxxxxxxxx> wrote:
Hi Rusty,
Sorry I took so long to get back to you.

Which is the newly added brick? I see datanode02 has not picked up any files for migration which is odd.
How full are the individual bricks (df -h ) output.
Is each of your bricks in a separate partition?
Can you send me the rebalance logs from all 3 nodes (offline if you prefer)?

We can try using scripts to speed up the rebalance if you prefer.

Regards,
Nithya

On 16 July 2018 at 22:06, Rusty Bower <rusty@xxxxxxxxxxxxxx> wrote:
Thanks for the reply Nithya.
1. glusterfs 4.1.1

2. Volume Name: data
Type: Distribute
Volume ID: 294d95ce-0ff3-4df9-bd8c-a52fc50442ba
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: datanode01:/mnt/data/bricks/data
Brick2: datanode02:/mnt/data/bricks/data
Brick3: datanode03:/mnt/data/bricks/data
Options Reconfigured:
performance.readdir-ahead: on

3.
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            36822        11.3GB         50715             0             0          in progress       26:46:17
                              datanode02                0        0Bytes          2852             0             0          in progress       26:46:16
                              datanode03             3128       513.7MB         11442             0          3128          in progress       26:46:17
Estimated time left for rebalance to complete : > 2 months. Please try again later.volume rebalance: data: success

4. Directory structure is basically an rsync backup of some old systems as well as all of my personal media. I can elaborate more, but it's a pretty standard filesystem.

5. In some folders there might be up to like 12-15 levels of directories (especially the backups)

6. I'm honestly not sure, I can try to scrounge this number up

7. My guess would be > 100k

8. Most files are pretty large (media files), but there's a lot of small files (metadata and configuration files) as well

I've also appended a (moderately sanitized) 

snippet of the rebalance log (let me know if you need more)

[2018-07-16 17:37:59.979003] I [MSGID: 0] [dht-rebalance.c:1799:dht_migrate_file] 0-data-dht: destination for file - /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2040036.img.xml is changed to - data-client-2
[2018-07-16 17:38:00.004262] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2112002.img.xml from subvolume data-client-0 to data-client-2
[2018-07-16 17:38:00.725582] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt = 55419279917056,rate_processed=446597.869797, elapsed = 96526.000000
[2018-07-16 17:38:00.725641] I [dht-rebalance.c:5130:gf_defrag_status_get] 0-glusterfs: TIME: Estimated total time to complete (size)= 124092127 seconds, seconds left = 123995601
[2018-07-16 17:38:00.725709] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 96526.00 secs
[2018-07-16 17:38:00.725738] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:02.769121] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt = 55419279917056,rate_processed=446588.616567, elapsed = 96528.000000
[2018-07-16 17:38:02.769207] I [dht-rebalance.c:5130:gf_defrag_status_get] 0-glusterfs: TIME: Estimated total time to complete (size)= 124094698 seconds, seconds left = 123998170
[2018-07-16 17:38:02.769263] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 96528.00 secs
[2018-07-16 17:38:02.769286] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:03.410469] I [dht-rebalance.c:1645:dht_migrate_file] 0-data-dht: /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9201002.img.xml: attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:03.416127] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2040036.img.xml from subvolume data-client-0 to data-client-2
[2018-07-16 17:38:04.738885] I [dht-rebalance.c:1645:dht_migrate_file] 0-data-dht: /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9110012.img.xml: attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:04.745722] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9201002.img.xml from subvolume data-client-0 to data-client-2
[2018-07-16 17:38:04.812368] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size) total_processed=43108308134 tmp_cnt = 55419279917056,rate_processed=446579.386035, elapsed = 96530.000000
[2018-07-16 17:38:04.812417] I [dht-rebalance.c:5130:gf_defrag_status_get] 0-glusterfs: TIME: Estimated total time to complete (size)= 124097263 seconds, seconds left = 124000733
[2018-07-16 17:38:04.812465] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 96530.00 secs
[2018-07-16 17:38:04.812489] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated: 36877, size: 12270261443, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:04.992413] I [dht-rebalance.c:1645:dht_migrate_file] 0-data-dht: /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2050000.img.xml: attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:04.994122] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9110012.img.xml from subvolume data-client-0 to data-client-2
[2018-07-16 17:38:06.855618] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt = 55419279917056,rate_processed=446570.244043, elapsed = 96532.000000
[2018-07-16 17:38:06.855719] I [dht-rebalance.c:5130:gf_defrag_status_get] 0-glusterfs: TIME: Estimated total time to complete (size)= 124099804 seconds, seconds left = 124003272
[2018-07-16 17:38:06.855770] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 96532.00 secs
[2018-07-16 17:38:06.855793] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:08.511064] I [dht-rebalance.c:1645:dht_migrate_file] 0-data-dht: /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9201055.img.xml: attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:08.533029] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2050000.img.xml from subvolume data-client-0 to data-client-2
[2018-07-16 17:38:08.899708] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt = 55419279917056,rate_processed=446560.991961, elapsed = 96534.000000
[2018-07-16 17:38:08.899791] I [dht-rebalance.c:5130:gf_defrag_status_get] 0-glusterfs: TIME: Estimated total time to complete (size)= 124102375 seconds, seconds left = 124005841
[2018-07-16 17:38:08.899842] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 96534.00 secs
[2018-07-16 17:38:08.899865] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0

On Mon, Jul 16, 2018 at 7:37 AM, Nithya Balachandran <nbalacha@xxxxxxxxxx> wrote:
If possible, please send the rebalance logs as well.

On 16 July 2018 at 10:14, Nithya Balachandran <nbalacha@xxxxxxxxxx> wrote:
Hi Rusty,
We need the following information:
The exact gluster version you are running
gluster volume info <volname>
gluster rebalance status
Information on the directory structure and file locations on your volume. 
How many levels of directories
How many files and directories in each level
How many directories and files in total (a rough estimate)
Average file size
Please note that having a rebalance running in the background should not affect your volume access in any way. However I would like to know why only 6000 files have been scanned in 6 hours.

Regards,
Nithya

On 16 July 2018 at 06:13, Rusty Bower <rusty@xxxxxxxxxxxxxx> wrote:

Hey folks,
I just added a new brick to my existing gluster volume, but gluster volume rebalance data status is telling me the following: Estimated time left for rebalance to complete : > 2 months. Please try again later.

I already did a fix-mapping, but this thing is absolutely crawling trying to rebalance everything (last estimate was ~40 years)

Any thoughts on if this is a bug, or ways to speed this up? It's taking ~6 hours to scan 6000 files, which seems unreasonably slow.

Thanks
Rusty

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users