We do need to consider this as bug and fix full self-heal to handle the case where it has to look at both the bricks to see if there are any files missing in the bricks. We won't be letting this happen on the mounts though because it will slow down performance. Be very careful about deleting files directly from the brick though. It is always recommended you take back up of the good file before attempting heal.
On Wed, Aug 17, 2016 at 4:28 PM, Дмитрий Глушенок <glush@xxxxxxxxxx> wrote:
You are right, stat triggers self-heal. Thank you!--Dmitry GlushenokJet Infosystems17 авг. 2016 г., в 13:38, Ravishankar N <ravishankar@xxxxxxxxxx> написал(а):On 08/17/2016 03:48 PM, Дмитрий Глушенок wrote:
Unfortunately not:
Remount FS, then access test file from second client:
[root@srv02 ~]# umount /mnt[root@srv02 ~]# mount -t glusterfs srv01:/test01 /mnt[root@srv02 ~]# ls -l /mnt/passwd-rw-r--r--. 1 root root 1505 авг 16 19:59 /mnt/passwd[root@srv02 ~]# ls -l /R1/test01/итого 4-rw-r--r--. 2 root root 1505 авг 16 19:59 passwd[root@srv02 ~]#
Then remount FS and check if accessing the file from second node triggered self-heal on first node:
[root@srv01 ~]# umount /mnt[root@srv01 ~]# mount -t glusterfs srv01:/test01 /mnt[root@srv01 ~]# ls -l /mnt
Can you try `stat /mnt/passwd` from this node after remounting? You need to explicitly lookup the file. `ls -l /mnt` is only triggering readdir on the parent directory.
If that doesn't work, is this mount connected to both clients? i.e. if you create a new file from here, is it getting replicated to both bricks?
-Ravi
итого 0[root@srv01 ~]# ls -l /R1/test01/итого 0[root@srv01 ~]#
Nothing appeared.
[root@srv01 ~]# gluster volume info test01Volume Name: test01Type: ReplicateVolume ID: 2c227085-0b06-4804-805c-ea9c1bb11d8b Status: StartedNumber of Bricks: 1 x 2 = 2Transport-type: tcpBricks:Brick1: srv01:/R1/test01Brick2: srv02:/R1/test01Options Reconfigured:features.scrub-freq: hourlyfeatures.scrub: Activefeatures.bitrot: ontransport.address-family: inetperformance.readdir-ahead: onnfs.disable: on[root@srv01 ~]#
[root@srv01 ~]# gluster volume get test01 all | grep healcluster.background-self-heal-count 8 cluster.metadata-self-heal oncluster.data-self-heal oncluster.entry-self-heal oncluster.self-heal-daemon oncluster.heal-timeout 600cluster.self-heal-window-size 1cluster.data-self-heal-algorithm (null) cluster.self-heal-readdir-size 1KBcluster.heal-wait-queue-length 128features.lock-heal offfeatures.lock-heal offstorage.health-check-interval 30features.ctr_lookupheal_link_timeout 300 features.ctr_lookupheal_inode_timeout 300 cluster.disperse-self-heal-daemon enable disperse.background-heals 8disperse.heal-wait-qlength 128cluster.heal-timeout 600cluster.granular-entry-heal no[root@srv01 ~]#
--Dmitry GlushenokJet Infosystems
17 авг. 2016 г., в 11:30, Ravishankar N <ravishankar@xxxxxxxxxx> написал(а):
On 08/17/2016 01:48 PM, Дмитрий Глушенок wrote:
Hello Ravi,
Thank you for reply. Found bug number (for those who will google the email) https://bugzilla.redhat.com/show_bug.cgi?id=1112158
Accessing the removed file from mount-point is not always working because we have to find a special client which DHT will point to the brick with removed file. Otherwise the file will be accessed from good brick and self-healing will not happen (just verified). Or by accessing you meant something like touch?
Sorry should have been more explicit. I meant triggering a lookup on that file with `stat filename`. I don't think you need a special client. DHT sends the lookup to AFR which in turn sends to all its children. When one of them returns ENOENT (because you removed it from the brick), AFR will automatically trigger heal. I'm guessing it is not always working in your case due to caching at various levels and the lookup not coming till AFR. If you do it from a fresh mount ,it should always work.
-Ravi
Dmitry GlushenokJet Infosystems
17 авг. 2016 г., в 4:24, Ravishankar N <ravishankar@xxxxxxxxxx> написал(а):
On 08/16/2016 10:44 PM, Дмитрий Глушенок wrote:
Hello,
While testing healing after bitrot error it was found that self healing cannot heal files which were manually deleted from brick. Gluster 3.8.1:
- Create volume, mount it locally and copy test file to it
[root@srv01 ~]# gluster volume create test01 replica 2 srv01:/R1/test01 srv02:/R1/test01
volume create: test01: success: please start the volume to access data
[root@srv01 ~]# gluster volume start test01
volume start: test01: success
[root@srv01 ~]# mount -t glusterfs srv01:/test01 /mnt
[root@srv01 ~]# cp /etc/passwd /mnt
[root@srv01 ~]# ls -l /mnt
итого 2
-rw-r--r--. 1 root root 1505 авг 16 19:59 passwd
- Then remove test file from first brick like we have to do in case of bitrot error in the file
You also need to remove all hard-links to the corrupted file from the brick, including the one in the .glusterfs folder.
There is a bug in heal-full that prevents it from crawling all bricks of the replica. The right way to heal the corrupted files as of now is to access them from the mount-point like you did after removing the hard-links. The list of files that are corrupted can be obtained with the scrub status command.
Hope this helps,
Ravi
[root@srv01 ~]# rm /R1/test01/passwd
[root@srv01 ~]# ls -l /mnt
итого 0
[root@srv01 ~]#
- Issue full self heal
[root@srv01 ~]# gluster volume heal test01 full
Launching heal operation to perform full self heal on volume test01 has been successful
Use heal info commands to check status
[root@srv01 ~]# tail -2 /var/log/glusterfs/glustershd.log
[2016-08-16 16:59:56.483767] I [MSGID: 108026] [afr-self-heald.c:611:afr_shd_full_healer] 0-test01-replicate-0: starting full sweep on subvol test01-client-0
[2016-08-16 16:59:56.486560] I [MSGID: 108026] [afr-self-heald.c:621:afr_shd_full_healer] 0-test01-replicate-0: finished full sweep on subvol test01-client-0
- Now we still see no files in mount point (it becomes empty right after removing file from the brick)
[root@srv01 ~]# ls -l /mnt
итого 0
[root@srv01 ~]#
- Then try to access file by using full name (lookup-optimize and readdir-optimize are turned off by default). Now glusterfs shows the file!
[root@srv01 ~]# ls -l /mnt/passwd
-rw-r--r--. 1 root root 1505 авг 16 19:59 /mnt/passwd
- And it reappeared in the brick
[root@srv01 ~]# ls -l /R1/test01/
итого 4
-rw-r--r--. 2 root root 1505 авг 16 19:59 passwd
[root@srv01 ~]#
Is it a bug or we can tell self heal to scan all files on all bricks in the volume?
--
Dmitry Glushenok
Jet Infosystems
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
--
Pranith
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users