do you have a core dump which can be inspected? do you still see the error after taking the 1.3 client off? Avati On Sat, Mar 14, 2009 at 10:42 PM, Andrew McGill <list2008 at lunch.za.net> wrote: > Hello, > > I upgraded from glusterfs-1.3.12.tar.gz to glusterfs-2.0.0rc4.tar.gz because I > could not complete a rdiff-backup without inexplicable errors. My efforts > have been rewarded with a crash, for which some logs are displayed below. > > The backend is unify, with multiple afr subvolumes of two ext3 volumes each. > > Here is how the client side of glusterfs died: > > 2009-03-14 13:32:15 W [afr-self-heal-data.c:798:afr_sh_data_fix] afr4: Picking > favorite child u100-rs1 as authentic source to resolve conflicting data > of /backup5/robbie.foo.co.za/rdiff-backup-data/mirror_metadata.2009-03-14T08:00:17+02:00.snapshot.gz > 2009-03-14 13:32:15 W [afr-self-heal-data.c:646:afr_sh_data_open_cbk] afr4: > sourcing file /backup5/robbie.foo.co.za/rdiff-backup-data/mirror_meta > data.2009-03-14T08:00:17+02:00.snapshot.gz from u100-rs1 to other sinks > 2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] u50-dcc1: readv failed (Bad > address) > 2009-03-14 13:32:22 E [socket.c:634:__socket_proto_state_machine] u50-dcc1: > read (Bad address) in state 3 (192.168.227.65:6996) > 2009-03-14 13:32:22 E [saved-frames.c:169:saved_frames_unwind] u50-dcc1: > forced unwinding frame type(1) op(READ) > 2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] u50-dr1: readv failed (Bad > address) > 2009-03-14 13:32:22 E [socket.c:634:__socket_proto_state_machine] u50-dr1: > read (Bad address) in state 3 (192.168.227.31:6996) > 2009-03-14 13:32:22 E [saved-frames.c:169:saved_frames_unwind] u50-dr1: forced > unwinding frame type(1) op(READ) > 2009-03-14 13:32:22 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse: > 5998294: READ => -1 (Transport endpoint is not connected) > 2009-03-14 13:33:03 E [socket.c:102:__socket_rwv] u50-dr2: readv failed (Bad > address) > 2009-03-14 13:33:03 E [socket.c:634:__socket_proto_state_machine] u50-dr2: > read (Bad address) in state 3 (192.168.227.32:6996) > 2009-03-14 13:33:03 E [saved-frames.c:169:saved_frames_unwind] u50-dr2: forced > unwinding frame type(1) op(READ) > 2009-03-14 13:33:03 E [socket.c:102:__socket_rwv] u50-rs3: readv failed (Bad > address) > 2009-03-14 13:33:03 E [socket.c:634:__socket_proto_state_machine] u50-rs3: > read (Bad address) in state 3 (192.168.227.59:6996) > 2009-03-14 13:33:03 E [saved-frames.c:169:saved_frames_unwind] u50-rs3: forced > unwinding frame type(1) op(READ) > 2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse: > 6006118: READ => -1 (Transport endpoint is not connected) > 2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse: > 6006119: READ => -1 (Transport endpoint is not connected) > 2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse: > 6006120: READ => -1 (Transport endpoint is not connected) > 2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse: > 6006121: READ => -1 (Transport endpoint is not connected) > pending frames: > ?? > patchset: cb602a1d7d41587c24379cb2636961ab91446f86 + > signal received: 6 > configuration details:argp 1 > backtrace 1 > dlfcn 1 > fdatasync 1 > libpthread 1 > llistxattr 1 > setfsid 1 > spinlock 1 > epoll.h 1 > xattr.h 1 > st_atim.tv_nsec 1 > package-string: glusterfs 2.0.0rc4 > [0x381420] > /lib/libc.so.6(abort+0x101)[0xb86451] > /usr/lib/glusterfs/2.0.0rc4/xlator/mount/fuse.so[0x54b9a8] > /lib/libpthread.so.0[0xd302db] > /lib/libc.so.6(clone+0x5e)[0xc2912e] > --------- > > > On the server side, the following messages don't enlighten me, but do remind > me that there was another client running version 1.13 still connecting. It > looks like the server just noticed that the client died. > > 2009-03-14 13:30:03 E [socket.c:583:__socket_proto_state_machine] server: > socket header validate failed (192.168.227.167:1023). possible mismatch of > transport-type between server and client volumes, or version mismatch > 2009-03-14 13:30:03 N [server-protocol.c:8048:notify] server: > 192.168.227.167:1023 disconnected > 2009-03-14 13:31:45 E [socket.c:463:__socket_proto_validate_header] server: > socket header signature does not match :O (42.6c.6f) > 2009-03-14 13:31:45 E [socket.c:583:__socket_proto_state_machine] server: > socket header validate failed (192.168.227.167:1023). possible mismatch of > transport-type between server and client volumes, or version mismatch > 2009-03-14 13:31:45 N [server-protocol.c:8048:notify] server: > 192.168.227.167:1023 disconnected > 2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] server: readv failed > (Connection reset by peer) > 2009-03-14 13:32:22 E [socket.c:561:__socket_proto_state_machine] server: read > (Connection reset by peer) in state 1 (192.168.227.5:1020) > 2009-03-14 13:32:22 N [server-protocol.c:8048:notify] server: > 192.168.227.5:1020 disconnected > 2009-03-14 13:32:22 N [server-protocol.c:7295:mop_setvolume] server: accepted > client from 192.168.227.5:1020 > 2009-03-14 13:35:48 N [server-protocol.c:8048:notify] server: > 192.168.227.5:1017 disconnected > 2009-03-14 13:35:48 N [server-protocol.c:8048:notify] server: > 192.168.227.5:1020 disconnected > 2009-03-14 13:35:48 N [server-helpers.c:515:server_connection_destroy] server: > destroyed connection of backup5.foo.com-23205-2009/03/14-07:10: > 52:777008-u50-dcc1 > > > On another server brick, the 25Gb volume u50-dr1-raw was full (it should have > been 50Gb like its peer). As I recall, the free space of the second volume > of AFR which does not get checked for disk space (a bug, IMHO). > > It said this, which could have led to the client-side failure a few minutes > later (the clocks are in sync): > > 2009-03-14 13:30:23 W [posix.c:773:posix_mkdir] u50-dr1-raw: mkdir > of /backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-backup.tmp.1: No > space left on device > 2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server: > 1109657: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff- > backup.tmp.1) on u50-dr1 returning error: -1 (2) > 2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server: > 1109658: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff- > backup.tmp.1) on u50-dr1 returning error: -1 (2) > 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server: > 3184942: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff- > backup.tmp.1) on u50-dr1 for key <nul> returning error: -1 (2) > 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server: > 3184943: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff- > backup.tmp.1) on u50-dr1 for key <nul> returning error: -1 (2) > 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server: > 3184947: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff- > backup.tmp.1) on u50-dr1 for key hl returning error: -1 (2) > 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server: > 3184949: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff- > backup.tmp.1) on u50-dr1 for key hl returning error: -1 (2) > 2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server: > 1109660: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff- > backup.tmp.1/hl) on u50-dr1 returning error: -1 (2) > 2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server: > 1109661: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff- > backup.tmp.1/hl) on u50-dr1 returning error: -1 (2) > 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server: > 3184952: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff- > backup.tmp.1/hl) on u50-dr1 for key <nul> returning error: -1 (2) > 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server: > 3184953: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff- > backup.tmp.1/hl) on u50-dr1 for key <nul> returning error: -1 (2) > 2009-03-14 13:30:23 E [server-protocol.c:2774:server_stub_resume] server: > 1109663: XATTROP (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff- > backup.tmp.1) on u50-dr1 returning error: -1 (2) > 2009-03-14 13:30:23 E [server-protocol.c:2868:server_stub_resume] server: > 1109665: RMDIR (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-ba > ckup.tmp.1) on u50-dr1 returning error: -1 (2) > 2009-03-14 13:31:45 E [socket.c:463:__socket_proto_validate_header] server: > socket header signature does not match :O (42.6c.6f) > 2009-03-14 13:31:45 E [socket.c:583:__socket_proto_state_machine] server: > socket header validate failed (192.168.227.167:1022). possible mismatch of > transport-type between server and client volumes, or version mismatch > 2009-03-14 13:31:45 N [server-protocol.c:8048:notify] server: > 192.168.227.167:1022 disconnected > 2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] server: readv failed > (Connection reset by peer) > 2009-03-14 13:32:22 E [socket.c:561:__socket_proto_state_machine] server: read > (Connection reset by peer) in state 1 (192.168.227.5:1016) > 2009-03-14 13:32:22 N [server-protocol.c:8048:notify] server: > 192.168.227.5:1016 disconnected > 2009-03-14 13:32:23 N [server-protocol.c:7295:mop_setvolume] server: accepted > client from 192.168.227.5:1016 > > > I may have to move the backup in question off glusterfs (if I can just find > the space somewhere), since it has taken 4 days to realise that the backing > up is not just slow, but faulty. (Of course, if I can't fix it, I'll win a > trip to the data center to install a new machine to replace the system.) > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users >