2011/6/28 Darren Austin <darren-lists at widgit.com>: > > Also, when one of the servers disconnects, is it notmal that the client "stalls" the write until the keepalive time expires and the online servers notice one has vanished? > You can modify the parameter network.ping-timeout from 46sec to 5 or 10 second to reduce the "time stalls" of client. > Finally, during my testing I encountered a replicable hard lock up of the client... here's the situation: > ?Server1 and Server2 in the cluster, sharing 'data-volume' (which is /data on both servers). > ?Client mounts server1:data-volume as /mnt. > ?Client begins to write a large (1 or 2 GB) file to /mnt ?(I just used random data). > ?Server1 goes down part way through the write (I simulated this by iptables -j DROP'ing everything from relevant IPs). > ?Client "stalls" writes until the keepalive timeout, and then continues to send data to Server2. > ?Server1 comes back online shortly after the keepalive timeout - but BEFORE the Client has written all the data toServer2. > ?Server1 and Server2 reconnect and the writes on the Client completely hang. > I have similar problem with a file that I'm using with KVM for storage virtual disk > The mounted directory on the client becomes completely in-accessible when the two servers reconnect. > actualy is normal :-| > I had to kill -9 the dd process doing the write (along with the glusterfs process on the client) in order to release the mountpoint. > If you don't kill the process and wait that all node are syncronized all the system should return ready. To force a syncronization of all volume you can type these command on the client: find <gluster-mount> -noleaf -print0 | xargs --null stat >/dev/null ... and wait http://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Triggering_Self-Heal_on_Replicate Craig Carl said me, three days ago: ------------------------------------------------------ ?that happens because Gluster's self heal is a blocking operation. We are working on a non-blocking self heal, we are hoping to ship it in early September. ------------------------------------------------------ You can verify that directly from your client log... you can read that: [2011-06-28 13:28:17.484646] I [client-lk.c:617:decrement_reopen_fd_count] 0-data-volume-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP Marco