Error report: glusterfs2.0rc4 abend -- "readv failed (Bad address) "

list2008 at lunch.za.net (Andrew McGill) · Wed, 18 Mar 2009 15:27:08 +0200

Replying to myself, since nobody else did:

I'm outta here for now.  Whether by network errors or software errors, or 
plain old stupidity, I can longer maintain a stable glusterfs mount.  Due to 
the batch nature of the application I'm running (rdiff-backup), this is a 
fatal failure.  I've now decommissioned my failing glusterfs installation.  
It's a pity though - although it was horribly slow because of inappropriate 
hardware, it did work (until it flaked out fatally).  

For decommissioning, it turns out that having your data stored without 
metadata is a good design choice.  

Currently glusterfs needs some work in terms of recovering from network and 
server errors.  (I think that the developers should be made to run it over 
10Mbps ethernet hubs for a month.)  Currently glusterfs is a high capacity 
(given appropriate hardware), but not a high availability solution. 

On Saturday 14 March 2009 19:12:03 Andrew McGill wrote:
> Hello,
>
> I upgraded from glusterfs-1.3.12.tar.gz to glusterfs-2.0.0rc4.tar.gz
> because I could not complete a rdiff-backup without inexplicable errors. 
> My efforts have been rewarded with a crash, for which some logs are
> displayed below.
>
> The backend is unify, with multiple afr subvolumes of two ext3 volumes
> each.
>
> Here is how the client side of glusterfs died:
>
> 2009-03-14 13:32:15 W [afr-self-heal-data.c:798:afr_sh_data_fix] afr4:
> Picking favorite child u100-rs1 as authentic source to resolve conflicting
> data of
> /backup5/robbie.foo.co.za/rdiff-backup-data/mirror_metadata.2009-03-14T08:0
>0:17+02:00.snapshot.gz 2009-03-14 13:32:15 W
> [afr-self-heal-data.c:646:afr_sh_data_open_cbk] afr4: sourcing file
> /backup5/robbie.foo.co.za/rdiff-backup-data/mirror_meta
> data.2009-03-14T08:00:17+02:00.snapshot.gz from u100-rs1 to other sinks
> 2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] u50-dcc1: readv failed
> (Bad address)
> 2009-03-14 13:32:22 E [socket.c:634:__socket_proto_state_machine] u50-dcc1:
> read (Bad address) in state 3 (192.168.227.65:6996)
> 2009-03-14 13:32:22 E [saved-frames.c:169:saved_frames_unwind] u50-dcc1:
> forced unwinding frame type(1) op(READ)
> 2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] u50-dr1: readv failed
> (Bad address)
> 2009-03-14 13:32:22 E [socket.c:634:__socket_proto_state_machine] u50-dr1:
> read (Bad address) in state 3 (192.168.227.31:6996)
> 2009-03-14 13:32:22 E [saved-frames.c:169:saved_frames_unwind] u50-dr1:
> forced unwinding frame type(1) op(READ)
> 2009-03-14 13:32:22 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse:
> 5998294: READ => -1 (Transport endpoint is not connected)
> 2009-03-14 13:33:03 E [socket.c:102:__socket_rwv] u50-dr2: readv failed
> (Bad address)
> 2009-03-14 13:33:03 E [socket.c:634:__socket_proto_state_machine] u50-dr2:
> read (Bad address) in state 3 (192.168.227.32:6996)
> 2009-03-14 13:33:03 E [saved-frames.c:169:saved_frames_unwind] u50-dr2:
> forced unwinding frame type(1) op(READ)
> 2009-03-14 13:33:03 E [socket.c:102:__socket_rwv] u50-rs3: readv failed
> (Bad address)
> 2009-03-14 13:33:03 E [socket.c:634:__socket_proto_state_machine] u50-rs3:
> read (Bad address) in state 3 (192.168.227.59:6996)
> 2009-03-14 13:33:03 E [saved-frames.c:169:saved_frames_unwind] u50-rs3:
> forced unwinding frame type(1) op(READ)
> 2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse:
> 6006118: READ => -1 (Transport endpoint is not connected)
> 2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse:
> 6006119: READ => -1 (Transport endpoint is not connected)
> 2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse:
> 6006120: READ => -1 (Transport endpoint is not connected)
> 2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse:
> 6006121: READ => -1 (Transport endpoint is not connected)
> pending frames:
> ?
> patchset: cb602a1d7d41587c24379cb2636961ab91446f86 +
> signal received: 6
> configuration details:argp 1
> backtrace 1
> dlfcn 1
> fdatasync 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 2.0.0rc4
> [0x381420]
> /lib/libc.so.6(abort+0x101)[0xb86451]
> /usr/lib/glusterfs/2.0.0rc4/xlator/mount/fuse.so[0x54b9a8]
> /lib/libpthread.so.0[0xd302db]
> /lib/libc.so.6(clone+0x5e)[0xc2912e]
> ---------
>
>
> On the server side, the following messages don't enlighten me, but do
> remind me that there was another client running version 1.13 still
> connecting.  It looks like the server just noticed that the client died.
>
> 2009-03-14 13:30:03 E [socket.c:583:__socket_proto_state_machine] server:
> socket header validate failed (192.168.227.167:1023). possible mismatch of
>  transport-type between server and client volumes, or version mismatch
> 2009-03-14 13:30:03 N [server-protocol.c:8048:notify] server:
> 192.168.227.167:1023 disconnected
> 2009-03-14 13:31:45 E [socket.c:463:__socket_proto_validate_header] server:
> socket header signature does not match :O (42.6c.6f)
> 2009-03-14 13:31:45 E [socket.c:583:__socket_proto_state_machine] server:
> socket header validate failed (192.168.227.167:1023). possible mismatch of
>  transport-type between server and client volumes, or version mismatch
> 2009-03-14 13:31:45 N [server-protocol.c:8048:notify] server:
> 192.168.227.167:1023 disconnected
> 2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] server: readv failed
> (Connection reset by peer)
> 2009-03-14 13:32:22 E [socket.c:561:__socket_proto_state_machine] server:
> read (Connection reset by peer) in state 1 (192.168.227.5:1020)
> 2009-03-14 13:32:22 N [server-protocol.c:8048:notify] server:
> 192.168.227.5:1020 disconnected
> 2009-03-14 13:32:22 N [server-protocol.c:7295:mop_setvolume] server:
> accepted client from 192.168.227.5:1020
> 2009-03-14 13:35:48 N [server-protocol.c:8048:notify] server:
> 192.168.227.5:1017 disconnected
> 2009-03-14 13:35:48 N [server-protocol.c:8048:notify] server:
> 192.168.227.5:1020 disconnected
> 2009-03-14 13:35:48 N [server-helpers.c:515:server_connection_destroy]
> server: destroyed connection of backup5.foo.com-23205-2009/03/14-07:10:
> 52:777008-u50-dcc1
>
>
> On another server brick, the 25Gb volume u50-dr1-raw was full (it should
> have been 50Gb like its peer).  As I recall, the free space of the second
> volume of AFR which does not get checked for disk space (a bug, IMHO).
>
> It said this, which could have led to the client-side failure a few minutes
> later (the clocks are in sync):
>
> 2009-03-14 13:30:23 W [posix.c:773:posix_mkdir] u50-dr1-raw: mkdir
> of /backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-backup.tmp.1: No
> space left on device
> 2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server:
> 1109657: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 returning error: -1 (2)
> 2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server:
> 1109658: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 returning error: -1 (2)
> 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server:
> 3184942: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 for key <nul> returning error: -1 (2)
> 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server:
> 3184943: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 for key <nul> returning error: -1 (2)
> 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server:
> 3184947: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 for key hl returning error: -1 (2)
> 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server:
> 3184949: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 for key hl returning error: -1 (2)
> 2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server:
> 1109660: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1/hl) on u50-dr1 returning error: -1 (2)
> 2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server:
> 1109661: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1/hl) on u50-dr1 returning error: -1 (2)
> 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server:
> 3184952: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1/hl) on u50-dr1 for key <nul> returning error: -1 (2)
> 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server:
> 3184953: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1/hl) on u50-dr1 for key <nul> returning error: -1 (2)
> 2009-03-14 13:30:23 E [server-protocol.c:2774:server_stub_resume] server:
> 1109663: XATTROP (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 returning error: -1 (2)
> 2009-03-14 13:30:23 E [server-protocol.c:2868:server_stub_resume] server:
> 1109665: RMDIR (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-ba
> ckup.tmp.1) on u50-dr1 returning error: -1 (2)
> 2009-03-14 13:31:45 E [socket.c:463:__socket_proto_validate_header] server:
> socket header signature does not match :O (42.6c.6f)
> 2009-03-14 13:31:45 E [socket.c:583:__socket_proto_state_machine] server:
> socket header validate failed (192.168.227.167:1022). possible mismatch of
>  transport-type between server and client volumes, or version mismatch
> 2009-03-14 13:31:45 N [server-protocol.c:8048:notify] server:
> 192.168.227.167:1022 disconnected
> 2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] server: readv failed
> (Connection reset by peer)
> 2009-03-14 13:32:22 E [socket.c:561:__socket_proto_state_machine] server:
> read (Connection reset by peer) in state 1 (192.168.227.5:1016)
> 2009-03-14 13:32:22 N [server-protocol.c:8048:notify] server:
> 192.168.227.5:1016 disconnected
> 2009-03-14 13:32:23 N [server-protocol.c:7295:mop_setvolume] server:
> accepted client from 192.168.227.5:1016
>
>
> I may have to move the backup in question off glusterfs (if I can just find
> the space somewhere), since it has taken 4 days to realise that the backing
> up is not just slow, but faulty.  (Of course, if I can't fix it, I'll win a
> trip to the data center to install a new machine to replace the system.)
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users