Re: gluster 3.0.0 catastrophic crash during basic file creation test

Vijay Bellur <vijay@xxxxxxxxxxx> · Thu, 04 Feb 2010 22:12:07 +0530

Hello Daniel,

Do you notice anything in dmesg when  the server freeze happens?

If you can salvage dmesg, /var/log/messages from the console and 
glusterfsd core, that would help.

Regards,
Vijay

Daniel Maher wrote:

Hello,

I managed to crash Gluster 3.0.0 severely during a simple file 
creation test.  Not only did the crash result in the standard « 
transport endpoint not connected » problem, but the servers in 
question had to be hard-reset in order to make them operational again.

So, here goes...

4 nodes, two servers, two clients, client-side replication.  Clients 
are Fedora 8, servers are Fedora 9.  Stock FUSE used throughout. 
Configurations generated with the volgen tool using the following 
commandline :

# glusterfs-volgen --name replicated --raid 1 s01:/opt/gluster 
s02:/opt/gluster

Servers :
# service glusterfsd start

Clients :
# mount -t glusterfs /etc/glusterfs/replicated-tcp.vol /opt/gluster/

The following Python script was used to run the file creation test :
http://nfsv4.bullopensource.org/tools/tests_tools/test_files.py

The Python script was edited only to point the target directory to the 
Gluster mount.  Each client was told to use a different sub-directory 
within the Gluster mount point.

This script was used in the context of a bash looping script, which is 
as follows :
#!/bin/bash
LOOP=0
while [ $LOOP -lt 1000 ]
do
    time ./test_files.py | tee -a go_test_files.log
    cat ./test_files_orw | tee -a go_test_files.log
    let LOOP=$LOOP+1
done

« test_files_orw » is the file that test_files.py outputs to.  It is 
over-written on each run (hence the redirect).

The script made it through 20 or so iterations before Gluster crashed. 
The servers responded to ping requests, but no new SSH connections 
could be made.  Existing sessions open via SSH were frozen.  On the 
local console, keyboard interactions were still possible, but no new 
actions could be taken.  The servers were hard-reset at this point.

I'll be happy to provide any further information as is deemed 
necessary - just let me know.