Hello, looks like the full heal fixed the problem, i was just impatient :)
[2017-11-16 15:04:34.102010] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-data01-replicate-0: performing metadata selfheal on 9612ecd2-106d-42f2-95eb-fef495c1d8ab
[2017-11-16 15:04:34.186781] I [MSGID: 108026] [afr-self-heal-common.c:1255:afr_log_selfheal] 0-data01-replicate-0: Completed metadata selfheal on 9612ecd2-106d-42f2-95eb-fef495c1d8ab. sources=[1] sinks=0
[2017-11-16 15:04:38.776070] I [MSGID: 108026] [afr-self-heal-common.c:1255:afr_log_selfheal] 0-data01-replicate-0: Completed data selfheal on 7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54. sources=[1] sinks=0
[2017-11-16 15:04:38.811744] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-data01-replicate-0: performing metadata selfheal on 7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54
[2017-11-16 15:04:38.867474] I [MSGID: 108026] [afr-self-heal-common.c:1255:afr_log_selfheal] 0-data01-replicate-0: Completed metadata selfheal on 7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54. sources=[1] sinks=0
# gluster volume heal data01 info
Brick 192.168.186.11:/mnt/AIDATA/data
Status: Connected
Number of entries: 0
Brick 192.168.186.12:/mnt/AIDATA/data
Status: Connected
Number of entries: 0
Thank you for your fast response!
On Thu, Nov 16, 2017 at 10:13 AM, Frederic Harmignies <frederic.harmignies@xxxxxxxxxxxxx> wrote:
Hello, we are using glusterfs 3.10.3.We currently have a gluster heal volume full running, the crawl is still running.Starting time of crawl: Tue Nov 14 15:58:35 2017Crawl is in progressType of crawl: FULLNo. of entries healed: 0No. of entries in split-brain: 0No. of heal failed entries: 0getfattr from both files:# getfattr -d -m . -e hex /mnt/AIDATA/data//ishmaelb/experiments/omie/omieali/ cifar10/donsker_grad_reg_ali_ dcgan_stat_dcgan_ac_True/ omieali_cifar10_zdim_100_enc_ dcgan_dec_dcgan_stat_dcgan_ posterior_propagated_enc_ beta1.0_dec_beta_1.0_info_ metric_donsker_varadhan_info_ lam_0.334726025306_222219-23_ 10_17/data/data_gen_iter_ 86000.pkl getfattr: Removing leading '/' from absolute path names# file: mnt/AIDATA/data//ishmaelb/experiments/omie/omieali/ cifar10/donsker_grad_reg_ali_ dcgan_stat_dcgan_ac_True/ omieali_cifar10_zdim_100_enc_ dcgan_dec_dcgan_stat_dcgan_ posterior_propagated_enc_ beta1.0_dec_beta_1.0_info_ metric_donsker_varadhan_info_ lam_0.334726025306_222219-23_ 10_17/data/data_gen_iter_ 86000.pkl security.selinux=0x73797374656d5f753a6f626a6563 745f723a756e6c6162656c65645f74 3a733000 trusted.afr.data01-client-0=0x000000000000000100000000 trusted.gfid=0x7e8513f4d4e24e66b0ba2dbe4c80 3c54 # getfattr -d -m . -e hex /mnt/AIDATA/data/home/allac/experiments/171023_105655_ mini_imagenet_projection_size_ mixing_depth_num_filters_ filter_size_block_depth_ Explore\ architecture\ capacity/Explore\ architecture\ capacity\(projection_size\=32\ ;mixing_depth\=0\;num_filters\ =64\;filter_size\=3\;block_ depth\=3\)/model.ckpt-70001. data-00000-of-00001. tempstate1629411508065733704 getfattr: Removing leading '/' from absolute path names# file: mnt/AIDATA/data/home/allac/experiments/171023_105655_ mini_imagenet_projection_size_ mixing_depth_num_filters_ filter_size_block_depth_ Explore architecture capacity/Explore architecture capacity(projection_size=32; mixing_depth=0;num_filters=64; filter_size=3;block_depth=3)/ model.ckpt-70001.data-00000- of-00001. tempstate1629411508065733704 security.selinux=0x73797374656d5f753a6f626a6563 745f723a756e6c6162656c65645f74 3a733000 trusted.afr.data01-client-0=0x000000000000000000000000 trusted.bit-rot.version=0x02000000000000005979d278000a f1e7 trusted.gfid=0x9612ecd2106d42f295ebfef495c1 d8ab # gluster volume heal data01Launching heal operation to perform index self heal on volume data01 has been successfulUse heal info commands to check status# cat /var/log/glusterfs/glustershd.log [2017-11-12 08:39:01.907287] I [glusterfsd-mgmt.c:1789:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing [2017-11-15 08:18:02.084766] I [MSGID: 100011] [glusterfsd.c:1414:reincarnate] 0-glusterfsd: Fetching the volume file from server... [2017-11-15 08:18:02.085718] I [glusterfsd-mgmt.c:1789:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing [2017-11-15 19:13:42.005307] W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data01-client-0: remote operation failed. Path: <gfid:7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54> (7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54) [No such file or directory] The message "W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data01-client-0: remote operation failed. Path: <gfid:7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54> (7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54) [No such file or directory]" repeated 5 times between [2017-11-15 19:13:42.005307] and [2017-11-15 19:13:42.166579] [2017-11-15 19:23:43.041956] W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data01-client-0: remote operation failed. Path: <gfid:7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54> (7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54) [No such file or directory] The message "W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data01-client-0: remote operation failed. Path: <gfid:7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54> (7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54) [No such file or directory]" repeated 5 times between [2017-11-15 19:23:43.041956] and [2017-11-15 19:23:43.235831] [2017-11-15 19:30:22.726808] W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data01-client-0: remote operation failed. Path: <gfid:7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54> (7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54) [No such file or directory] The message "W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data01-client-0: remote operation failed. Path: <gfid:7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54> (7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54) [No such file or directory]" repeated 4 times between [2017-11-15 19:30:22.726808] and [2017-11-15 19:30:22.827631] [2017-11-16 15:04:34.102010] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-data01-replicate-0: performing metadata selfheal on 9612ecd2-106d-42f2-95eb- fef495c1d8ab [2017-11-16 15:04:34.186781] I [MSGID: 108026] [afr-self-heal-common.c:1255:afr_log_selfheal] 0-data01-replicate-0: Completed metadata selfheal on 9612ecd2-106d-42f2-95eb- fef495c1d8ab. sources=[1] sinks=0 [2017-11-16 15:04:38.776070] I [MSGID: 108026] [afr-self-heal-common.c:1255:afr_log_selfheal] 0-data01-replicate-0: Completed data selfheal on 7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54. sources=[1] sinks=0 [2017-11-16 15:04:38.811744] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-data01-replicate-0: performing metadata selfheal on 7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54 [2017-11-16 15:04:38.867474] I [MSGID: 108026] [afr-self-heal-common.c:1255:afr_log_selfheal] 0-data01-replicate-0: Completed metadata selfheal on 7e8513f4-d4e2-4e66-b0ba- 2dbe4c803c54. sources=[1] sinks=0 On Thu, Nov 16, 2017 at 7:14 AM, Ravishankar N <ravishankar@xxxxxxxxxx> wrote:No, explicit lookup should have healed the file on the missing brick. Unless lookup did not hit afr and is served from caching translators.
On 11/16/2017 04:12 PM, Nithya Balachandran wrote:
On 15 November 2017 at 19:57, Frederic Harmignies <frederic.harmignies@elementai.com > wrote:
Hello, we have 2x files that are missing from one of the bricks. No idea how to fix this.
Details:
# gluster volume infoVolume Name: data01Type: ReplicateVolume ID: 39b4479c-31f0-4696-9435-5454e4f8d310 Status: StartedSnapshot Count: 0Number of Bricks: 1 x 2 = 2Transport-type: tcpBricks:Brick1: 192.168.186.11:/mnt/AIDATA/data Brick2: 192.168.186.12:/mnt/AIDATA/data Options Reconfigured:performance.cache-refresh-timeout: 30 client.event-threads: 16server.event-threads: 32performance.readdir-ahead: offperformance.io-thread-count: 32performance.cache-size: 32GBtransport.address-family: inetnfs.disable: onfeatures.trash: offfeatures.trash-max-filesize: 500MB
# gluster volume heal data01 infoBrick 192.168.186.11:/mnt/AIDATA/data Status: ConnectedNumber of entries: 0
Brick 192.168.186.12:/mnt/AIDATA/data <gfid:7e8513f4-d4e2-4e66-b0ba-2dbe4c803c54> <gfid:9612ecd2-106d-42f2-95eb-fef495c1d8ab> Status: ConnectedNumber of entries: 2
# gluster volume heal data01 info split-brainBrick 192.168.186.11:/mnt/AIDATA/data Status: ConnectedNumber of entries in split-brain: 0
Brick 192.168.186.12:/mnt/AIDATA/data Status: ConnectedNumber of entries in split-brain: 0
Both files is missing from the folder on Brick1, the gfid files are also missing in the .gluster folder on that same Brick1.Brick2 has both the files and the gfid file in .gluster
We already tried:
#gluster heal volume fullRunning a stat and ls -l on both files from a mounted client to try and trigger a heal
Would a re-balance fix this? Any guidance would be greatly appreciated!
A rebalance would not help here as this is a replicate volume. Ravi, any idea what could be going wrong here?
Frederic, what version of gluster are you running? Can you launch 'gluster heal volume' and see glustershd logs for possible warnings? Use DEBUG client-log-level if you have to. Also, instead of stat, try a getfattr on the file from the mount.
-Ravi
Regards,Nithya
Thank you in advance!
--
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users
--
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users