Darren, Can you get us the process state dumps of the client when it is hung? (kill -USR1 <pid> of mount and gzip /tmp/glusterdump.<pid>). That will help us figuring out what exactly was happening. Avati On Tue, Jun 28, 2011 at 8:00 PM, Darren Austin <darren-lists at widgit.com>wrote: > > Can you check the server (brick) logs to check the order of detected > > disconnection and new/reconnection from the client? > > Hi, > It seems this wasn't due to keepalives - the system time on both server > was a few seconds out. After a pointer from someone off-list, I synced the > time and ran ntpd (which I wasn't doing as this was just a test system) and > did some more tests. > > The partial-file syndrome I noted before seems to have gone away - at least > in terms of the file not syncing back to the previously disconnected server > after it finds it way back into the cluster. Once the keepalive timeout is > reached, the client sends all the data to the second server. > > A quick question on that actually - when all servers are online, are the > clients supposed to send the data to both at the same time? I see from > monitoring the traffic that the client duplicates the writes - one to each > server? > > Also, when one of the servers disconnects, is it notmal that the client > "stalls" the write until the keepalive time expires and the online servers > notice one has vanished? > > Finally, during my testing I encountered a replicable hard lock up of the > client... here's the situation: > Server1 and Server2 in the cluster, sharing 'data-volume' (which is /data > on both servers). > Client mounts server1:data-volume as /mnt. > Client begins to write a large (1 or 2 GB) file to /mnt (I just used > random data). > Server1 goes down part way through the write (I simulated this by iptables > -j DROP'ing everything from relevant IPs). > Client "stalls" writes until the keepalive timeout, and then continues to > send data to Server2. > Server1 comes back online shortly after the keepalive timeout - but BEFORE > the Client has written all the data toServer2. > Server1 and Server2 reconnect and the writes on the Client completely > hang. > > The mounted directory on the client becomes completely in-accessible when > the two servers reconnect. > I had to kill -9 the dd process doing the write (along with the glusterfs > process on the client) in order to release the mountpoint. > > I've reproduced this issue several times now and the result is always the > same. If the client is writing data to a server when one of the others > comes back online after an outage, the client will hang. > > I've attached logs for one of the times I tested this - I hope it helps in > diagnosing the problem :) > > Let me know if you need any more info. > > -- > Darren Austin - Systems Administrator, Widgit Software. > Tel: +44 (0)1926 333680. Web: http://www.widgit.com/ > 26 Queen Street, Cubbington, Warwickshire, CV32 7NA. > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20110629/33c34984/attachment.htm>