Re: self heal problem

"Tejas N. Bhise" <tejas@xxxxxxxxxxx> · Wed, 24 Mar 2010 08:46:37 -0600 (CST)

Hi Stephan,

GlusterFS keeps track if an operation happened on one copy but not 
on the replica, in case a replica was not accessible. From the attributes
remote1 and remote2, it shows that there is no pending operation on the other
replica.

>From the attributes you have shown it seems that you have gone to 
the backend directly, bypassed glusterfs, and hand crafted such a 
situation. The way the code is written, we do not think that we can
reach the state you have shown in your example.

The remote1 and remote2 attributes show all zeroes which means
that there were no operations pending on any server.

If not hand crafted, then please give the detailed testcase which can 
lead to this situation based on just filesize.

If this situation was handcrafted  then it would be akin to 
overwriting the section of a disk which carries the metadata of a 
filesystem and then claiming that the FS is getting corrupted.

Please see the other code around the one you have pointed in the
other mail and you can see the other higher order checks that are
made.

Regards,
Tejas.

----- Original Message -----
From: "Stephan von Krawczynski" <skraw@xxxxxxxxxx>
To: gluster-devel@xxxxxxxxxx
Sent: Tuesday, March 23, 2010 7:33:17 PM GMT +05:30 Chennai, Kolkata, Mumbai, New Delhi
Subject: Re: self heal problem

Let me show you this further information for one file falsly self-healed:

server1:

# getfattr -d -m '.*' -e hex <filename>
getfattr: Removing leading '/' from absolute path names
# file: <filename>
trusted.afr.remote1=0x000000000000000000000000
trusted.afr.remote2=0x000000000000000000000000
trusted.posix.gen=0x4b9bb33c00001be6

# stat <filename>
  File: <filename>
  Size: 4509            Blocks: 16         IO Block: 4096   reguläre Datei
Device: 804h/2052d      Inode: 16560280    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2010-03-23 11:10:36.000000000 +0100
Modify: 2010-03-23 00:32:25.000000000 +0100
Change: 2010-03-23 12:36:40.000000000 +0100

server2:

# getfattr -d -m '.*' -e hex <filename>
getfattr: Removing leading '/' from absolute path names
# file: <filename>
trusted.afr.remote1=0x000000000000000000000000
trusted.afr.remote2=0x000000000000000000000000
trusted.posix.gen=0x4b9bb2f600001be6

# stat <filename>
  File: <filename>
  Size: 4024            Blocks: 8          IO Block: 4096   reguläre Datei
Device: 804h/2052d      Inode: 42762291    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2010-03-23 11:10:36.000000000 +0100
Modify: 2010-03-23 14:32:23.000000000 +0100
Change: 2010-03-23 14:32:23.000000000 +0100

As you can see the latest file version is on server2 (modify date) and is _smaller_ in size.

Now on client 2 a ls shows interesting values:

# ls -l <filename>
-rw-r--r--  1 root root 4509 Mar 23 14:37 <filename>

As you can see here, the file date looks increased and the size clearly shows that self-heal went wrong.

Consequently the server2 copy now looks like:

# stat <filename>
  File: <filename>
  Size: 4509            Blocks: 16         IO Block: 4096   reguläre Datei
Device: 804h/2052d      Inode: 42762291    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2010-03-23 11:10:36.000000000 +0100
Modify: 2010-03-23 00:32:25.000000000 +0100
Change: 2010-03-23 14:41:13.000000000 +0100

Modification date went back and file size is increased, so the older file version was choosen to overwrite the newer one.

-- 
Regards,
Stephan

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
http://lists.nongnu.org/mailman/listinfo/gluster-devel