2 out of 4 bonnies failed :-((

Sascha Ottolski <ottolski@xxxxxx> · Mon, 7 Jan 2008 09:57:29 +0100

Hi,

I found a somewhat frustrating test result after the weekend. I startet a 
bonnie on four different clients (so a total of four bonnies in parallel). I 
have two servers, each two partitions, wich are unifed and afred "over 
cross", so each server has a brick and a mirrored brick of the other, using 
tla patch-628.

For one, the results seem to be not too promising, as it more than 48 hours 
hours to complete. Doing a bonnie on only one client took "only" about 12 
hours (unfortunately, I don't have exact numbers about the runtime).

But even worse, two of the bonnies didn't finish at all. The first client 
dropped out after approx. 8 hours, claiming "Can't open 
file ./Bonnie.17791.001". However, the file is (partly) there, also on the 
afr-mirror, but with different sizes. The log suggests that it was a timeout 
problem (if I interpret it correctly):

2008-01-06 03:48:10 E [afr.c:3364:afr_close_setxattr_cbk] afr1: 
(path=/Bonnie.17791.027 child=fsc1) op_ret=-1 op_errno=28
2008-01-06 03:50:34 W [client-protocol.c:209:call_bail] ns1: activating 
bail-out. pending frames = 1. last sent = 2008-01-06 03:48:17
. last received = 2008-01-06 03:48:17 transport-timeout = 108
2008-01-06 03:50:34 C [client-protocol.c:217:call_bail] ns1: bailing transport
2008-01-06 03:50:34 W [client-protocol.c:4490:client_protocol_cleanup] ns1: 
cleaning up state in transport object 0x522e40
2008-01-06 03:50:34 E [client-protocol.c:4542:client_protocol_cleanup] ns1: 
forced unwinding frame type(1) op(5) reply=@0x2aaaab407a0
0
2008-01-06 03:50:34 E [afr.c:2573:afr_selfheal_lock_cbk] afrns: 
(path=/Bonnie.17791.001 child=ns1) op_ret=-1 op_errno=107
2008-01-06 03:50:34 E [afr.c:2744:afr_open] afrns: self heal failed, returning 
EIO
2008-01-06 03:50:34 C [tcp.c:81:tcp_disconnect] ns1: connection disconnected
2008-01-06 03:51:00 E [afr.c:1907:afr_selfheal_sync_file_writev_cbk] afr1: 
(path=/Bonnie.17791.001 child=fsc1) op_ret=-1 op_errno=28
2008-01-06 03:51:00 E [afr.c:1693:afr_error_during_sync] afr1: error during 
self-heal
2008-01-06 03:51:03 E [afr.c:2744:afr_open] afr1: self heal failed, returning 
EIO
2008-01-06 03:51:03 E [fuse-bridge.c:670:fuse_fd_cbk] glusterfs-fuse: 
12276158: /Bonnie.17791.001 => -1 (5)
2008-01-07 04:40:17 E [fuse-bridge.c:431:fuse_entry_cbk] glusterfs-fuse: 
15841600: /Bonnie.26672.026 => -1 (2)

The second had a problem in creating / removing a dir:

Create files in sequential order...Can't make directory ./Bonnie.26672
Cleaning up test directory after error.
Bonnie: drastic I/O error (rmdir): No such file or directory

On this client, there is nothing found in the logs. For both cases, nothing is 
in the server logs either (both server and clients had no special debug level 
enabled).

No, the million dollar question is, how would I debug this situation, 
preferably a bit quicker than 48 hours...

Thanks,

Sascha