Complete machine lockup, v3.4.2

Laurent Chouinard <laurent.chouinard@xxxxxxxxxxx> · Tue, 18 Feb 2014 08:12:33 -0500

Hi everyone,

We’ve been using 3.3.2 for a while, and recently started to migrate to 3.4.2. We run on platform CentOS 6.5 for 3.4.2 (while 3.3.2 were installed on CentOS 6.4)

Recently, we’ve have a very scary condition happen and we do not know exactly the cause of it.

We have a 3 nodes cluster with a replication factor of 3. Each node has one brick, which is made out of one RAID0 volume, comprised of multiple SSDs.

Following some read/write errors, nodes 2 and 3 have completely locked. Nothing could be done physically (nothing on the screen, nothing by SSH), physical power cycle had to be done. Node 1 was still accessible, but its fuse client rejected most if not all reads and writes.

Has anyone experienced something similar? 

Before the system freeze, the last thing the kernel seemed to be doing is killing HTTPD threads (INFO: task httpd:7910 blocked for more than 120 seconds.)  End-users talk to Apache in order to read/write from the Gluster volume, so it seems a simple case of “something wrong” with gluster which locks read/writes, and eventually the kernel kills them.

At this point, we’re unsure where to look. Nothing very specific can be found in the logs, but perhaps if someone has pointers of what to look for, that could give us a new search track.

Thanks

Laurent Chouinard
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users