1.4.0RC6 AFR problems

freedman at FreeFormIT.com (Keith Freedman) · Tue, 23 Dec 2008 02:32:11 -0800

so, I had a drive failure on one of my boxes and it lead to discovery 
of numerous issues today:

1) when a drive is failing and one of the AFR servers is dealing with 
IO errors, the other one freaks out and sometimes crashes, but 
doesn't seem to ever network timeout.

2) when starting gluster on the server with the new empty drive, it 
gave me a bunch of errors about things being out of sync and to 
delete a file from all but the preferred server.
this struck me as odd, since the thing was empty.
so I used the favorite child, but this isn't a preferred solution long term.

3) one of the directories had 20GB of data in it.... I went to do an 
ls of the directory and had to wait while it auto-healed all the 
files..  while this is helpful, it would be nice to have gotten back 
the directory listing without having to wait for 20GB of data to get 
sent over the network.

4) while the other server was down, the up server kept failing.. 
signal 11?  and I had to constantly remount the filesystem.  It was 
giving me messages about the other node being down which was fine but 
then it'd just die after a while.. consistently.