Hi all, While running a series of FFSB tests against my newly-created Gluster cluster, i caused glusterfsd to crash on one of the two storage nodes. The relevant lines from the log file are pastebin'd : http://pastebin.ca/970831 Even more troubling, is that when i restarted glusterfsd, the node did /not/ self-heal : The mointpoint on the client : [dfsA]# du -s /opt/gfs-mount/ 2685304 /opt/gfs-mount/ The DS on the node which did not fail : [dfsC]# du -s /opt/gfs-ds/ 2685328 /opt/gfs-ds/ The DS on the node which failed, ~ 5 minutes after restarting glusterfsd : [dfsD]# du -s /opt/gfs-ds/ 27092 /opt/gfs-ds/ Even MORE troubling, i restarted glusterfsd on the node which did not fail, to see if that would help - and it created even more bizarre results : The mountpoint on the client : [dfsA]# du -s /opt/gfs-mount/ 17520 /opt/gfs-mount/ The DS on the node which did not fail : [dfsC]# du -s /opt/gfs-ds/ 2685328 /opt/gfs-ds/ The DS on the node which failed : [dfsD]# du -s /opt/gfs-ds/ 27092 /opt/gfs-ds/ A simple visual inspection of the files and directories shows that the files and directories are clearly different between the client and between the two nodes. For example : (Client) [dfsA]# ls fillfile* fillfile0 fillfile11 fillfile14 fillfile2 fillfile5 fillfile8 fillfile1 fillfile12 fillfile15 fillfile3 fillfile6 fillfile9 fillfile10 fillfile13 fillfile16 fillfile4 fillfile7 [dfsA]# ls -l fillfile? -rwx------ 1 root root 65536 2008-04-04 09:42 fillfile0 -rwx------ 1 root root 131072 2008-04-04 09:42 fillfile1 -rwx------ 1 root root 131072 2008-04-04 09:42 fillfile2 -rwx------ 1 root root 65536 2008-04-04 09:42 fillfile3 -rwx------ 1 root root 65536 2008-04-04 09:42 fillfile4 -rwx------ 1 root root 65536 2008-04-04 09:42 fillfile5 -rwx------ 1 root root 0 2008-04-04 09:42 fillfile6 -rwx------ 1 root root 0 2008-04-04 09:42 fillfile7 -rwx------ 1 root root 196608 2008-04-04 09:42 fillfile8 -rwx------ 1 root root 0 2008-04-04 09:42 fillfile9 (Node that didn't fail) [dfsC]# ls fillfile* fillfile0 fillfile13 fillfile18 fillfile22 fillfile4 fillfile9 fillfile1 fillfile14 fillfile19 fillfile23 fillfile5 fillfile10 fillfile15 fillfile2 fillfile24 fillfile6 fillfile11 fillfile16 fillfile20 fillfile25 fillfile7 fillfile12 fillfile17 fillfile21 fillfile3 fillfile8 [dfsC]# ls -l fillfile? -rwx------ 1 root root 65536 2008-04-04 09:42 fillfile0 -rwx------ 1 root root 131072 2008-04-04 09:42 fillfile1 -rwx------ 1 root root 131072 2008-04-04 09:42 fillfile2 -rwx------ 1 root root 65536 2008-04-04 09:42 fillfile3 -rwx------ 1 root root 65536 2008-04-04 09:42 fillfile4 -rwx------ 1 root root 65536 2008-04-04 09:42 fillfile5 -rwx------ 1 root root 0 2008-04-04 09:42 fillfile6 -rwx------ 1 root root 0 2008-04-04 09:42 fillfile7 -rwx------ 1 root root 196608 2008-04-04 09:42 fillfile8 -rwx------ 1 root root 0 2008-04-04 09:42 fillfile9 (Node that failed) [dfsD]# ls fillfile* fillfile0 fillfile11 fillfile14 fillfile2 fillfile5 fillfile8 fillfile1 fillfile12 fillfile15 fillfile3 fillfile6 fillfile9 fillfile10 fillfile13 fillfile16 fillfile4 fillfile7 [dfsD]# ls -l fillfile? -rwx------ 1 root root 65536 2008-04-04 09:08 fillfile0 -rwx------ 1 root root 131072 2008-04-04 09:08 fillfile1 -rwx------ 1 root root 4160139 2008-04-04 09:08 fillfile2 -rwx------ 1 root root 327680 2008-04-04 09:08 fillfile3 -rwx------ 1 root root 262144 2008-04-04 09:08 fillfile4 -rwx------ 1 root root 65536 2008-04-04 09:08 fillfile5 -rwx------ 1 root root 1196446 2008-04-04 09:08 fillfile6 -rwx------ 1 root root 131072 2008-04-04 09:08 fillfile7 -rwx------ 1 root root 3634506 2008-04-04 09:08 fillfile8 -rwx------ 1 root root 131072 2008-04-04 09:08 fillfile9 What the heck is going on here ? Three wildly different results - that's really not a good thing. These results seem "permanent" as well - after waiting a good 10 minutes (and executing the same du command a few more times), the results are the same... Finally, i edited "fillfile6" (0 bytes on dfsA and dfsC, 1196446 bytes on dfsD) via the mountpoint on dfsA, and the changes were immediately reflected on the storage nodes. Clearly the AFR translator is operational /now/, but the enormous discrepancy is not a good thing, to say the least. -- Daniel Maher <dma AT witbe.net>