Hi,
thank you for your answer and even more for the question! Until now, I was using FUSE. Today I changed all mounts to NFS using the same 3.7.17 version. But: The problem is still the same. Now, the NFS logfile contains lines like these: [2016-12-06 15:12:29.006325] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-gv0-client-7: server X.X.18.62:49153 has not responded in the last 42 seconds, disconnecting. Interestingly enough, the IP address X.X.18.62 is the same machine! As I wrote earlier, each node serves both as a server and a client, as each node contributes bricks to the volume. Every server is connecting to itself via its hostname. For example, the fstab on the node "giant2" looks like: #giant2:/gv0 /shared_data glusterfs defaults,noauto 0 0 #giant2:/gv2 /shared_slurm glusterfs defaults,noauto 0 0 giant2:/gv0 /shared_data nfs defaults,_netdev,vers=3 0 0 giant2:/gv2 /shared_slurm nfs defaults,_netdev,vers=3 0 0 So I understand the disconnects even less. I don't know if it's possible to create a dummy cluster which exposes the same behaviour, because the disconnects only happen when there are compute jobs running on those nodes - and they are GPU compute jobs, so that's something which cannot be easily emulated in a VM. As we have more clusters (which are running fine with an ancient 3.4 version :-)) and we are currently not dependent on this particular cluster (which may stay like this for this month, I think) I should be able to deploy the debug build on the "real" cluster, if you can provide a debug build. Regards and thanks, Micha Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C:
|
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users