Re: Missing files on one of the bricks

Frederic Harmignies <frederic.harmignies@xxxxxxxxxxxxx> · Thu, 16 Nov 2017 10:13:35 -0500

Hello, we are using glusterfs 3.10.3.
We currently have a gluster heal volume full running, the crawl is still running.

Starting time of crawl: Tue Nov 14 15:58:35 2017

Crawl is in progress
Type of crawl: FULL
No. of entries healed: 0
No. of entries in split-brain: 0
No. of heal failed entries: 0

getfattr from both files:

# getfattr -d -m . -e hex /mnt/AIDATA/data//ishmaelb/experiments/omie/omieali/cifar10/donsker_grad_reg_ali_dcgan_stat_dcgan_ac_True/omieali_cifar10_zdim_100_enc_dcgan_dec_dcgan_stat_dcgan_posterior_propagated_enc_beta1.0_dec_beta_1.0_info_metric_donsker_varadhan_info_lam_0.334726025306_222219-23_10_17/data/data_gen_iter_86000.pkl
getfattr: Removing leading '/' from absolute path names
# file: mnt/AIDATA/data//ishmaelb/experiments/omie/omieali/cifar10/donsker_grad_reg_ali_dcgan_stat_dcgan_ac_True/omieali_cifar10_zdim_100_enc_dcgan_dec_dcgan_stat_dcgan_posterior_propagated_enc_beta1.0_dec_beta_1.0_info_metric_donsker_varadhan_info_lam_0.334726025306_222219-23_10_17/data/data_gen_iter_86000.pkl
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.data01-client-0=0x000000000000000100000000
trusted.gfid=0x7e8513f4d4e24e66b0ba2dbe4c803c54

# getfattr -d -m . -e hex /mnt/AIDATA/data/home/allac/experiments/171023_105655_mini_imagenet_projection_size_mixing_depth_num_filters_filter_size_block_depth_Explore\ architecture\ capacity/Explore\ architecture\ capacity\(projection_size\=32\;mixing_depth\=0\;num_filters\=64\;filter_size\=3\;block_depth\=3\)/model.ckpt-70001.data-00000-of-00001.tempstate1629411508065733704
getfattr: Removing leading '/' from absolute path names
# file: mnt/AIDATA/data/home/allac/experiments/171023_105655_mini_imagenet_projection_size_mixing_depth_num_filters_filter_size_block_depth_Explore architecture capacity/Explore architecture capacity(projection_size=32;mixing_depth=0;num_filters=64;filter_size=3;block_depth=3)/model.ckpt-70001.data-00000-of-00001.tempstate1629411508065733704
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.data01-client-0=0x000000000000000000000000
trusted.bit-rot.version=0x02000000000000005979d278000af1e7
trusted.gfid=0x9612ecd2106d42f295ebfef495c1d8ab

# gluster volume heal data01
Launching heal operation to perform index self heal on volume data01 has been successful 
Use heal info commands to check status
# cat /var/log/glusterfs/glustershd.log
[2017-11-12 08:39:01.907287] I [glusterfsd-mgmt.c:1789:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2017-11-15 08:18:02.084766] I [MSGID: 100011] [glusterfsd.c:1414:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2017-11-15 08:18:02.085718] I [glusterfsd-mgmt.c:1789:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2017-11-15 19:13:42.005307] W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data01-client-0: remote operation failed. Path: <gfid:7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54> (7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54) [No such file or directory]
The message "W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data01-client-0: remote operation failed. Path: <gfid:7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54> (7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54) [No such file or directory]" repeated 5 times between [2017-11-15 19:13:42.005307] and [2017-11-15 19:13:42.166579]
[2017-11-15 19:23:43.041956] W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data01-client-0: remote operation failed. Path: <gfid:7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54> (7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54) [No such file or directory]
The message "W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data01-client-0: remote operation failed. Path: <gfid:7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54> (7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54) [No such file or directory]" repeated 5 times between [2017-11-15 19:23:43.041956] and [2017-11-15 19:23:43.235831]
[2017-11-15 19:30:22.726808] W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data01-client-0: remote operation failed. Path: <gfid:7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54> (7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54) [No such file or directory]
The message "W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data01-client-0: remote operation failed. Path: <gfid:7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54> (7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54) [No such file or directory]" repeated 4 times between [2017-11-15 19:30:22.726808] and [2017-11-15 19:30:22.827631]
[2017-11-16 15:04:34.102010] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-data01-replicate-0: performing metadata selfheal on 9612ecd2-106d-42f2-95eb-fef495c1d8ab
[2017-11-16 15:04:34.186781] I [MSGID: 108026] [afr-self-heal-common.c:1255:afr_log_selfheal] 0-data01-replicate-0: Completed metadata selfheal on 9612ecd2-106d-42f2-95eb-fef495c1d8ab. sources=[1]  sinks=0 
[2017-11-16 15:04:38.776070] I [MSGID: 108026] [afr-self-heal-common.c:1255:afr_log_selfheal] 0-data01-replicate-0: Completed data selfheal on 7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54. sources=[1]  sinks=0 
[2017-11-16 15:04:38.811744] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-data01-replicate-0: performing metadata selfheal on 7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54
[2017-11-16 15:04:38.867474] I [MSGID: 108026] [afr-self-heal-common.c:1255:afr_log_selfheal] 0-data01-replicate-0: Completed metadata selfheal on 7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54. sources=[1]  sinks=0 

On Thu, Nov 16, 2017 at 7:14 AM, Ravishankar N <ravishankar@xxxxxxxxxx> wrote:

    On 11/16/2017 04:12 PM, Nithya
      Balachandran wrote:

          On 15 November 2017 at 19:57,
            Frederic Harmignies <frederic.harmignies@elementai.com>
            wrote:

              Hello, we have 2x files that are missing
                from one of the bricks. No idea how to fix this.

                Details:

                  # gluster volume info

                  Volume Name: data01
                  Type: Replicate
                  Volume ID: 39b4479c-31f0-4696-9435-5454e4f8d310
                  Status: Started
                  Snapshot Count: 0
                  Number of Bricks: 1 x 2 = 2
                  Transport-type: tcp
                  Bricks:
                  Brick1: 192.168.186.11:/mnt/AIDATA/data
                  Brick2: 192.168.186.12:/mnt/AIDATA/data
                  Options Reconfigured:
                  performance.cache-refresh-timeout: 30
                  client.event-threads: 16
                  server.event-threads: 32
                  performance.readdir-ahead: off
                  performance.io-thread-count: 32
                  performance.cache-size: 32GB
                  transport.address-family: inet
                  nfs.disable: on
                  features.trash: off
                  features.trash-max-filesize: 500MB

                  # gluster volume heal data01 info
                  Brick 192.168.186.11:/mnt/AIDATA/data
                  Status: Connected
                  Number of entries: 0

                  Brick 192.168.186.12:/mnt/AIDATA/data
                  <gfid:7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54> 
                  <gfid:9612ecd2-106d-42f2-95eb-fef495c1d8ab> 
                  Status: Connected
                  Number of entries: 2

                  # gluster volume heal data01 info split-brain
                  Brick 192.168.186.11:/mnt/AIDATA/data
                  Status: Connected
                  Number of entries in split-brain: 0

                  Brick 192.168.186.12:/mnt/AIDATA/data
                  Status: Connected
                  Number of entries in split-brain: 0

                Both files is missing from the folder on Brick1,
                  the gfid files are also missing in the .gluster folder
                  on that same Brick1.
                Brick2 has both the files and the gfid file in
                  .gluster

                We already tried:

                 #gluster heal volume full
                Running a stat and ls -l on both files from a
                  mounted client to try and trigger a heal

                Would a re-balance fix this? Any guidance would be
                  greatly appreciated!

            A rebalance would not help here as this is a replicate
              volume. Ravi, any idea what could be going wrong here? 

    No, explicit lookup should have healed the file on the missing
    brick. Unless lookup did not hit afr and is served from caching
    translators.

    Frederic, what version of gluster are you running? Can you launch
    'gluster heal volume' and see glustershd logs for possible warnings?
    Use DEBUG client-log-level if you have to.  Also, instead of stat,
    try a getfattr on the file from the mount.

    -Ravi

            Regards,
            Nithya

                Thank you in advance!

                      -- 

                            Frederic
                                Harmignies
                              High
                                  Performance Computer Administrator

                              www.elementai.com

              _______________________________________________

              Gluster-users mailing list

              Gluster-users@xxxxxxxxxxx

              http://lists.gluster.org/mailman/listinfo/gluster-users

-- 

Frederic HarmigniesHigh Performance Computer Administrator

www.elementai.com

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users