Problem with active / active NFS cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear all,

My apologies up-front if this the wrong mailing list for this post.

To serve the needs of a fairly large (~ 1000 nodes) cluster for an
experiment at CERN we use the following setup:
5 servers running RHEL 5.3 (2.6.18-128.1.16.el5 x86_64) export two
filesystems rw via NFS v3 over UDP to the entire cluster. The clients
are predominantly RHEL4 (2.6.9-78.0.8.EL.cernsmp) both 32 and 64 bits.
The filesystems reside on a SAN and the servers mount them using
StorNext cvfs 3.5.1.
For load-sharing and high-availability we use the following setup:
there are as many virtual IPs as their are physical servers. The
clients mount from a DNS alias. When all servers are active each
server will export both file-systems via one of the virtual IPs. We
use heartbeat to handle failovers, which works very nicely.
/etc/exports on all servers looks like this
<SNIP>
*lbdaq.cern.ch(rw,sync,root_squash,no_subtree_check,fsid=100)
</SNIP>

Recently users have reported the following strange problems

1) a directory (with some subdirectories and files) was deleted on one
client. Immedeatly afterwards the directory was re-created from a CVS
checkout. On at least one other client this change was *not* visible.
The original contents (before the unlink) remained visible for more
than 1 hour (!). Unfortunately the problem "vanished" before we could
get traces (wireshark etc...). Fact is that the two client nodes in
question have mounted the fs from different servers.

2) One user claims that a freshly checked out directory (on which he
also worked from another node) has "vanished".

After googling for a while I realized that - at least from a
superficial search - that this specific configuration, where  several
servers serve the same filesystem at the same time, is rare.
My question now is: is there something fundamentally flawed or risky
in this setup? My (admittedly) not very deep understanding of NFS is
that this should not be worse in terms of (potential) caching problems
than a setup with a single server.

Any hint / comment is greatly appreciated.

Regards,
Niko
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux