I have implemented a cluster of a few xen guest with a shared GFS
filesystem residing on a SAN build with openfiler to support iSCSI storage.
Physical servers are 3 machines implementing a physical cluster, each
one equipped with quad xeon and 4 G RAM. The network interface is based
on channel bonding with LACP (on the physical hosts) having an aggregate
of 2 gigabits ethernet per physical host, the switch supports LACP and
has been configured accordingly.
Virtual servers are based on xen nodes on top of the physical server
with shared storage on iSCSI and GFS.
The networking is based on a cluster private network (for cluster
heartbeat and cluster communication + iSCSI) and an ethernet alias for
the LAN to which the users are connected.
One of the cluster xen nodes is used for implementing a samba PDC (no
failover of the service, plain samba, single samba server on the LAN)
plus ldap server; samba works with ldap for users authentication.
Storage for the samba server is on the SAN.
I continue to receive complaints from my users due to the fact that
sometimes copying file generates errors, plus problems related to office
usage (we still use the old Office 97 on some machines). The samba
configuration is more or less the same as that correctly working on the
previous physical machine, on which those problems were not present.
The problems generate these log entries on /var/log/samba/smbd:
[2008/04/02 19:00:50, 0] lib/util_sock.c:get_peer_addr(1232)
getpeername failed. Error was Transport endpoint is not connected
[2008/04/02 19:05:32, 0] lib/util_sock.c:get_peer_addr(1232)
getpeername failed. Error was Transport endpoint is not connected
[2008/04/02 19:05:32, 0] lib/util_sock.c:get_peer_addr(1232)
getpeername failed. Error was Transport endpoint is not connected
And on the client machine log also on /var/log/samba
[2008/04/02 19:04:34, 0] lib/util_sock.c:read_data(534)
read_data: read failure for 4 bytes to client 192.168.13.240. Error =
Connection reset by peer
[2008/04/02 19:04:34, 1] smbd/service.c:close_cnum(1230)
amhwq53p (192.168.13.240) closed connection to service tmp
[2008/04/02 19:04:34, 1] smbd/service.c:close_cnum(1230)
amhwq53p (192.168.13.240) closed connection to service stock
[2008/04/02 19:04:34, 0] lib/util_sock.c:write_data(562)
write_data: write failure in writing to client 192.168.13.240. Error
Broken pipe
[2008/04/02 19:04:34, 0] lib/util_sock.c:send_smb(769)
Error writing 75 bytes to client. -1. (Broken pipe)
[2008/04/02 19:04:34, 1] smbd/service.c:make_connection_snum(1033)
They seem similar to problems related to poor connectivity or problem in
the network; however, these problems are new and were never found before
switching to the clustered architecture. Also no problem have been found
so far on the other xen nodes serving the same GFS filesystem (different
dirs !) for NFS or other services.
Also putting the option
posix locking = no
on the smb.conf file did not help.
Any idea from someone else facing the same problems ?
thanks, Paolo
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster