Hi all,
first of all, glusterfs is a nice idea, liked it a lot, but it need to
be a rock-solid product so I made some tests , hope that this bug-report
will help you.
Configuration:
3 identical servers and a client, connected with TCP/IP
Servers: Mandrake 9.x (compiled, installed without problem) , client
CentOS-5 x86_64, fuse 2.6.5 compiled, installed from source, everything
went OK.
Those 3 servers provide 3 simple bricks joined at the client in full
mirror ( afr x 3) configuration + read-ahead & writebehind translators.
I made a PostgreSQL tablespace (zone) on the mounted /mnt/gfs and copied
there a 50 Mb table then stress-it with various operations.
10 read and full updates on every row in the table succeeded.
After a while, a simple "vacuum full analyze" gave the error:
glu=# vacuum full analyze;
ERROR: could not read block 43155 of relation
527933664/527933665/527933666: Transport endpoint is not connected
I repeated the tests many times, after 2,3 minutes of operation, I got
the same error, in another place but mostly in WRITE operations.
I deleted the whole mounted client disk and rebuild it with STRIPE
option instead of AFR translator.
The behaviour is the same ... after a couple of succeded operations, I
got a failure.
That were the facts, now ... the logs and configuration files.
The client debug log shows this errors:
[1:48:22] [CRITICAL/client-protocol.c:218/call_bail()]
client/protocol:bailing transport
[1:48:22] [DEBUG/tcp.c:123/cont_hand()] tcp:forcing poll/read/write to
break on blocked socket (if any)
[1:48:22] [CRITICAL/client-protocol.c:218/call_bail()]
client/protocol:bailing transport
[1:48:22] [ERROR/common-utils.c:110/full_rwv()] libglusterfs:full_rwv:
6689 bytes r/w instead of 8539 (Broken pipe)
[1:48:22] [ERROR/client-protocol.c:204/client_protocol_xfer()]
protocol/client:transport_submit failed
[1:48:22] [DEBUG/tcp.c:123/cont_hand()] tcp:forcing poll/read/write to
break on blocked socket (if any)
[1:48:22] [CRITICAL/client-protocol.c:218/call_bail()]
client/protocol:bailing transport
...
...
[1:48:22] [DEBUG/tcp.c:123/cont_hand()] tcp:forcing poll/read/write to
break on blocked socket (if any)
[1:48:22] [CRITICAL/client-protocol.c:218/call_bail()]
client/protocol:bailing transport
[1:48:22] [DEBUG/tcp.c:123/cont_hand()] tcp:forcing poll/read/write to
break on blocked socket (if any)
[1:48:22] [DEBUG/client-protocol.c:2708/client_protocol_interpret()]
protocol/client:frame not found for blk with callid: 139893
[1:48:22] [DEBUG/client-protocol.c:2605/client_protocol_cleanup()]
protocol/client:cleaning up state in transport object 0x16595730
[1:48:22] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:client1:
connection to server disconnected
[1:48:22] [CRITICAL/common-utils.c:215/gf_print_trace()]
debug-backtrace:Got signal (11), printing backtrace
[1:48:22] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/usr/lib/libglusterfs.so.0(gf_print_trace+0x21)
[0x2aaaaaccf4a1]
[1:48:22] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/lib64/libc.so.6 [0x2aaaab53b070]
[1:48:22] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/usr/lib/glusterfs/1.3.0-pre4/xlator/performance/read-ahead.so(ra_frame_return+0x142)
[0x2aaaac4da2a2]
[1:48:22] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/usr/lib/glusterfs/1.3.0-pre4/xlator/performance/read-ahead.so
[0x2aaaac4d9daa]
[1:48:22] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:[glusterfs] [0x40910b]
[1:48:22] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/usr/lib64/libfuse.so.2 [0x2aaaaaee3059]
[1:48:22] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:[glusterfs] [0x402f29]
[1:48:22] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/usr/lib/libglusterfs.so.0(sys_epoll_iteration+0xd4)
[0x2aaaaacd0ef4]
[1:48:22] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:[glusterfs] [0x402898]
[1:48:22] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/lib64/libc.so.6(__libc_start_main+0xf4) [0x2aaaab5288a4]
[1:48:22] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:[glusterfs] [0x4025f9]
The server configuration files are like that :
-------------------
volume brick
type storage/posix
option directory /var/gldata
end-volume
volume server
type protocol/server
option transport-type tcp/server
option listen-port 6996
option bind-address 29.11.276.x
subvolumes brick
option auth.ip.brick.allow *
end-volume
-------------------
The client configuration file is :
-------------------
volume clientX #{1,2,3}
type protocol/client
option transport-type tcp/client
option remote-host A.B.C.X
option remote-port 6996
option remote-subvolume brick
end-volume
### Add AFR feature to brick
volume afr
type cluster/afr
subvolumes client1 client2 client3
option replicate *:3 # All files 3 copies
end-volume
#volume stripe
# type cluster/stripe
# subvolumes client1 client2 client3
# option block-size *:256kB
#end-volume
#volume trace
# type debug/trace
# subvolumes afr
# option debug on
#end-volume
volume writebehind
type performance/write-behind
option aggregate-size 131072 # aggregate block size in bytes
subvolumes afr
end-volume
volume readahead
type performance/read-ahead
option page-size 131072 ### size in bytes
option page-count 16 ### page-size x page-count is the amount of
read-ahead data per file
subvolumes writebehind
end-volume
-------------------