Crash on HA config when restoring a server

Kevan Benson <kbenson@xxxxxxxxxxxxxxx> · Wed, 8 Aug 2007 11:02:27 -0700

When running a HA config with 2 servers and 2 clients, I can consistently 
crash the active server after failing the other.  This is on TLA version 
patched to 440.

System configs at http://glusterfs.pastebin.com/m52564c56
Server A: 172.16.1.81
Server B: 172.16.1.82
Client A: 172.16.1.85
Client B: 172.16.1.86
Note: Client transport-timeout (on clients and servers) was set to 10 in first 
two crashes, and set to 30 on Client A and B in the last one (servers still 
had it set to 10).

For the first crash, I fail server B (ifdown eth1), and then try to ls the 
mount point with the client (time ls -l /mnt/glusterfs) from both clients.  I 
generally get a "ls: /mnt/glusterfs/: Transport endpoint is not connected" 
error once or twice, and then the active server's (A) glusterfsd will either 
start responding or crash (about 50% chance).  In this case, I had restored 
network connectivity to server B and ran a few more ls's from the clients.

The glusterfsd.log (including backtrace) is at 
http://glusterfs.pastebin.com/m15d7f914

Upon restarting glusterfs on server A and restoring the network connection to 
server B, I initiated the above ls from the clients and crashed server A's 
glusterfsd again.  Glusterfsd on Server B was never restarted, it was failed 
because of lack of connectivity.

The glusterfsd.log (including backtrace) for THIS crash is at 
http://glusterfs.pastebin.com/m28ee8e5a

Here's a crash from doing an ls with one server failed, after restarting one 
of  the servers a few times.

The glusterfsd.log (including backtrace): 
http://glusterfs.pastebin.com/m2ee6c471

All logs shown are from the crashing server, Server A.  I can just as easily 
crash server B by failing A.  Let me know if you need more logs from other 
hosts and I'll re-run whichever scenarios you like,

-- 
- Kevan Benson
- A-1 Networks