Error report: glusterfs2.0rc4 abend -- "readv failed (Bad address) "

list2008 at lunch.za.net (Andrew McGill) · Sat, 14 Mar 2009 19:12:03 +0200

Hello,

I upgraded from glusterfs-1.3.12.tar.gz to glusterfs-2.0.0rc4.tar.gz because I 
could not complete a rdiff-backup without inexplicable errors.  My efforts 
have been rewarded with a crash, for which some logs are displayed below.

The backend is unify, with multiple afr subvolumes of two ext3 volumes each. 

Here is how the client side of glusterfs died:

2009-03-14 13:32:15 W [afr-self-heal-data.c:798:afr_sh_data_fix] afr4: Picking 
favorite child u100-rs1 as authentic source to resolve conflicting data 
of /backup5/robbie.foo.co.za/rdiff-backup-data/mirror_metadata.2009-03-14T08:00:17+02:00.snapshot.gz
2009-03-14 13:32:15 W [afr-self-heal-data.c:646:afr_sh_data_open_cbk] afr4: 
sourcing file /backup5/robbie.foo.co.za/rdiff-backup-data/mirror_meta
data.2009-03-14T08:00:17+02:00.snapshot.gz from u100-rs1 to other sinks
2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] u50-dcc1: readv failed (Bad 
address)
2009-03-14 13:32:22 E [socket.c:634:__socket_proto_state_machine] u50-dcc1: 
read (Bad address) in state 3 (192.168.227.65:6996)
2009-03-14 13:32:22 E [saved-frames.c:169:saved_frames_unwind] u50-dcc1: 
forced unwinding frame type(1) op(READ)
2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] u50-dr1: readv failed (Bad 
address)
2009-03-14 13:32:22 E [socket.c:634:__socket_proto_state_machine] u50-dr1: 
read (Bad address) in state 3 (192.168.227.31:6996)
2009-03-14 13:32:22 E [saved-frames.c:169:saved_frames_unwind] u50-dr1: forced 
unwinding frame type(1) op(READ)
2009-03-14 13:32:22 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse: 
5998294: READ => -1 (Transport endpoint is not connected)
2009-03-14 13:33:03 E [socket.c:102:__socket_rwv] u50-dr2: readv failed (Bad 
address)
2009-03-14 13:33:03 E [socket.c:634:__socket_proto_state_machine] u50-dr2: 
read (Bad address) in state 3 (192.168.227.32:6996)
2009-03-14 13:33:03 E [saved-frames.c:169:saved_frames_unwind] u50-dr2: forced 
unwinding frame type(1) op(READ)
2009-03-14 13:33:03 E [socket.c:102:__socket_rwv] u50-rs3: readv failed (Bad 
address)
2009-03-14 13:33:03 E [socket.c:634:__socket_proto_state_machine] u50-rs3: 
read (Bad address) in state 3 (192.168.227.59:6996)
2009-03-14 13:33:03 E [saved-frames.c:169:saved_frames_unwind] u50-rs3: forced 
unwinding frame type(1) op(READ)
2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse: 
6006118: READ => -1 (Transport endpoint is not connected)
2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse: 
6006119: READ => -1 (Transport endpoint is not connected)
2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse: 
6006120: READ => -1 (Transport endpoint is not connected)
2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse: 
6006121: READ => -1 (Transport endpoint is not connected)
pending frames:
?
patchset: cb602a1d7d41587c24379cb2636961ab91446f86 +
signal received: 6
configuration details:argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 2.0.0rc4
[0x381420]
/lib/libc.so.6(abort+0x101)[0xb86451]
/usr/lib/glusterfs/2.0.0rc4/xlator/mount/fuse.so[0x54b9a8]
/lib/libpthread.so.0[0xd302db]
/lib/libc.so.6(clone+0x5e)[0xc2912e]
---------

On the server side, the following messages don't enlighten me, but do remind 
me that there was another client running version 1.13 still connecting.  It 
looks like the server just noticed that the client died.

2009-03-14 13:30:03 E [socket.c:583:__socket_proto_state_machine] server: 
socket header validate failed (192.168.227.167:1023). possible mismatch of
 transport-type between server and client volumes, or version mismatch
2009-03-14 13:30:03 N [server-protocol.c:8048:notify] server: 
192.168.227.167:1023 disconnected
2009-03-14 13:31:45 E [socket.c:463:__socket_proto_validate_header] server: 
socket header signature does not match :O (42.6c.6f)
2009-03-14 13:31:45 E [socket.c:583:__socket_proto_state_machine] server: 
socket header validate failed (192.168.227.167:1023). possible mismatch of
 transport-type between server and client volumes, or version mismatch
2009-03-14 13:31:45 N [server-protocol.c:8048:notify] server: 
192.168.227.167:1023 disconnected
2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] server: readv failed 
(Connection reset by peer)
2009-03-14 13:32:22 E [socket.c:561:__socket_proto_state_machine] server: read 
(Connection reset by peer) in state 1 (192.168.227.5:1020)
2009-03-14 13:32:22 N [server-protocol.c:8048:notify] server: 
192.168.227.5:1020 disconnected
2009-03-14 13:32:22 N [server-protocol.c:7295:mop_setvolume] server: accepted 
client from 192.168.227.5:1020
2009-03-14 13:35:48 N [server-protocol.c:8048:notify] server: 
192.168.227.5:1017 disconnected
2009-03-14 13:35:48 N [server-protocol.c:8048:notify] server: 
192.168.227.5:1020 disconnected
2009-03-14 13:35:48 N [server-helpers.c:515:server_connection_destroy] server: 
destroyed connection of backup5.foo.com-23205-2009/03/14-07:10:
52:777008-u50-dcc1

On another server brick, the 25Gb volume u50-dr1-raw was full (it should have 
been 50Gb like its peer).  As I recall, the free space of the second volume 
of AFR which does not get checked for disk space (a bug, IMHO).  

It said this, which could have led to the client-side failure a few minutes 
later (the clocks are in sync):

2009-03-14 13:30:23 W [posix.c:773:posix_mkdir] u50-dr1-raw: mkdir 
of /backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-backup.tmp.1: No 
space left on device
2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server: 
1109657: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
backup.tmp.1) on u50-dr1 returning error: -1 (2)
2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server: 
1109658: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
backup.tmp.1) on u50-dr1 returning error: -1 (2)
2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server: 
3184942: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
backup.tmp.1) on u50-dr1 for key <nul> returning error: -1 (2)
2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server: 
3184943: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
backup.tmp.1) on u50-dr1 for key <nul> returning error: -1 (2)
2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server: 
3184947: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
backup.tmp.1) on u50-dr1 for key hl returning error: -1 (2)
2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server: 
3184949: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
backup.tmp.1) on u50-dr1 for key hl returning error: -1 (2)
2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server: 
1109660: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
backup.tmp.1/hl) on u50-dr1 returning error: -1 (2)
2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server: 
1109661: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
backup.tmp.1/hl) on u50-dr1 returning error: -1 (2)
2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server: 
3184952: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
backup.tmp.1/hl) on u50-dr1 for key <nul> returning error: -1 (2)
2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server: 
3184953: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
backup.tmp.1/hl) on u50-dr1 for key <nul> returning error: -1 (2)
2009-03-14 13:30:23 E [server-protocol.c:2774:server_stub_resume] server: 
1109663: XATTROP (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
backup.tmp.1) on u50-dr1 returning error: -1 (2)
2009-03-14 13:30:23 E [server-protocol.c:2868:server_stub_resume] server: 
1109665: RMDIR (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-ba
ckup.tmp.1) on u50-dr1 returning error: -1 (2)
2009-03-14 13:31:45 E [socket.c:463:__socket_proto_validate_header] server: 
socket header signature does not match :O (42.6c.6f)
2009-03-14 13:31:45 E [socket.c:583:__socket_proto_state_machine] server: 
socket header validate failed (192.168.227.167:1022). possible mismatch of
 transport-type between server and client volumes, or version mismatch
2009-03-14 13:31:45 N [server-protocol.c:8048:notify] server: 
192.168.227.167:1022 disconnected
2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] server: readv failed 
(Connection reset by peer)
2009-03-14 13:32:22 E [socket.c:561:__socket_proto_state_machine] server: read 
(Connection reset by peer) in state 1 (192.168.227.5:1016)
2009-03-14 13:32:22 N [server-protocol.c:8048:notify] server: 
192.168.227.5:1016 disconnected
2009-03-14 13:32:23 N [server-protocol.c:7295:mop_setvolume] server: accepted 
client from 192.168.227.5:1016

I may have to move the backup in question off glusterfs (if I can just find 
the space somewhere), since it has taken 4 days to realise that the backing 
up is not just slow, but faulty.  (Of course, if I can't fix it, I'll win a 
trip to the data center to install a new machine to replace the system.)