Hi all,
I have a HA configuration, 3 nodes with replication *:3 on server side.
3 servers 1 client on a host also doing gluster server. AFR and UNIFY
are on server side exactly as described in the wiki's tutorials
(Kritical's). I launch my 3 servers, they comunicate and establish all
connexions well one with the others. But when I execute the client,
mounting the volume report inconsistencies on the server the client
connects to, self heal reports a mismatch, seams that self healing is
started or at leas requested for these files on all server, but client
enters an infinite loop and load average increases over 60, this
cient/server host becomes unavailable.
GFS rel 1.3.5 (gcc 3.3, glibc 2.3.6; linux kern 2.6.16.latest)
Fuse 2.7.0-glfs4
Server log (where the client connects to):
2007-10-17 09:18:21 W [unify-self-heal.c:261:unify_sh_readdir_cbk]
tbs-clust-unify-or2: found mismatch in mode/uid/gid for .qmail-srv376cc
2007-10-17 09:18:21 W [unify-self-heal.c:261:unify_sh_readdir_cbk]
tbs-clust-unify-or2: found mismatch in mode/uid/gid for
.qmail-srv376cc-accept-default
Client log:
2007-10-17 09:17:54 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse:
LOOKUP 1/home (/home)
2007-10-17 09:17:54 D [fuse-bridge.c:346:fuse_entry_cbk] glusterfs-fuse:
ENTRY => 3424340
2007-10-17 09:17:54 D [inode.c:559:__create_inode] fuse/inode: create
inode(3424340)
2007-10-17 09:17:54 D [inode.c:351:__active_inode] fuse/inode:
activating inode(3424340), lru=0/1024
2007-10-17 09:17:54 D [inode.c:381:__passive_inode] fuse/inode:
passivating inode(3424340), lru=1/1024
2007-10-17 09:17:54 D [inode.c:308:__destroy_inode] fuse/inode: destroy
inode(0) [@0x8092c78]
2007-10-17 09:17:54 D [inode.c:351:__active_inode] fuse/inode:
activating inode(3424340), lru=0/1024
2007-10-17 09:17:54 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse:
LOOKUP 3424340/http2 (/home/http2)
2007-10-17 09:18:19 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse:
LOOKUP 1/home (/home)
2007-10-17 09:18:21 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse:
LOOKUP 1/home (/home)
2007-10-17 09:18:44 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse:
LOOKUP 1/home (/home)
[...] loops for ages
2007-10-17 09:41:16 W [client-protocol.c:210:call_bail]
tbs-clust-server-or2: activating bail-out. pending frames = 53. last
sent = 2007-10-17 09:38:20. last received = 2007-10-17 09:17:54
transport-timeout = 108
2007-10-17 09:41:16 C [client-protocol.c:218:call_bail]
tbs-clust-server-or2: bailing transport
2007-10-17 09:41:16 W [client-protocol.c:4324:client_protocol_cleanup]
tbs-clust-server-or2: cleaning up state in transport
object 0x8090e78
2007-10-17 09:41:16 D [tcp.c:131:cont_hand] tcp: forcing poll/read/write
to break on blocked socket (if any)
2007-10-17 09:41:16 W [client-protocol.c:4376:client_protocol_cleanup]
tbs-clust-server-or2: forced unwinding frame type(0)
op(34) reply=@0x80986b8
2007-10-17 09:41:16 D [fuse-bridge.c:378:fuse_entry_cbk] glusterfs-fuse:
ERR => -1 (107)
2007-10-17 09:41:16 D [inode.c:308:__destroy_inode] fuse/inode: destroy
inode(0) [@0x8093088]
[...]
2007-10-17 09:45:01 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse:
LOOKUP 1/home (/home)
2007-10-17 09:45:01 D [fuse-bridge.c:346:fuse_entry_cbk] glusterfs-fuse:
ENTRY => 3424340
2007-10-17 09:45:01 D [inode.c:559:__create_inode] fuse/inode: create
inode(3424340)
2007-10-17 09:45:01 D [inode.c:351:__active_inode] fuse/inode:
activating inode(3424340), lru=0/1024
2007-10-17 09:45:01 D [inode.c:381:__passive_inode] fuse/inode:
passivating inode(3424340), lru=1/1024
2007-10-17 09:45:01 D [inode.c:308:__destroy_inode] fuse/inode: destroy
inode(0) [@0x80930f0]
2007-10-17 09:45:01 D [inode.c:351:__active_inode] fuse/inode:
activating inode(3424340), lru=0/1024
2007-10-17 09:45:01 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse:
LOOKUP 3424340/http2 (/home/http2)
2007-10-17 09:45:08 D [client-protocol.c:4301:client_protocol_reconnect]
tbs-clust-server-or2: breaking reconnect chain
2007-10-17 09:45:26 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse:
LOOKUP 1/home (/home)
2007-10-17 09:45:27 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse:
LOOKUP 1/home (/home)
2007-10-17 09:45:51 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse:
LOOKUP 1/home (/home)
2007-10-17 09:46:16 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse:
LOOKUP 1/home (/home)
-> and loops again as before
Other 2 server logs:
2007-10-17 09:17:54 D [inode.c:351:__active_inode] tbs-clust-dor3/inode:
activating inode(9568336), lru=2/1024
2007-10-17 09:17:54 D [inode.c:381:__passive_inode]
tbs-clust-dor3/inode: passivating inode(9568336), lru=3/1024
2007-10-17 09:17:54 D [inode.c:308:__destroy_inode]
tbs-clust-dor3/inode: destroy inode(0) [@0xb7c00df0]
2007-10-17 09:17:54 D [inode.c:351:__active_inode]
tbs-clust-nsor3/inode: activating inode(9666831), lru=2/1024
2007-10-17 09:17:54 D [inode.c:381:__passive_inode]
tbs-clust-nsor3/inode: passivating inode(9666831), lru=3/1024
2007-10-17 09:17:54 D [inode.c:308:__destroy_inode]
tbs-clust-nsor3/inode: destroy inode(0) [@0xb7a01150]
NB: the directory where mismatch are found has a large number of files
(a few tens of thouthands). When I ls in that directory (on glusterfs
mount point); it takes ages to return the answer. There are also many
symlinks in that directory (about 20%).
I saw a stat-prefetch translator that I dont have presently, can this
help in that situation ? Is it a post 1.3.5 functionality ? (1.3.6?)
--
Vincent Régnard
vregnard-at-tbs-internet-dot-com
TBS-internet.com
027 630 5902