Re: Disconnections and Corruption Under High Load

Gordan Bobic <gordan@xxxxxxxxxx> · Thu, 07 Jan 2010 10:44:43 +0000

Anand Avati wrote:
On Tue, Jan 5, 2010 at 5:31 PM, Gordan Bobic <gordan@xxxxxxxxxx> wrote:
I've noticed a very high incidence of the problem I reported a while back,
that manifests itself in open files getting corrupted on commit, possibly
during conditions that involve server disconnections due to timeouts (very
high disk load). Specifically, I've noticed that my .viminfo file got
corrupted for the 3rd time today. Since this is root's .viminfo, and I'm
running glfs as root, I don't have the logs to verify the disconnections,
though. From what I can tell, a chunk of a dll somehow ends up in .viminfo,
but I'm not sure which one.

Can you describe the sequence of events? What kind of IO was being
performed from all clients involved? Was vi opened (on the same file?)
from multiple clients? Was some other kind of IO (rsync?) being
performed on another client at the same time?

The I/O client was relatively lightweight - normal desktop use, web 
browser and mail reader open, a bunch of gnome-terminal windows. vi 
wasn't opened on any of the files that were open, the only things I was 
editing at the time was fstab and the gluster volume spec files. There 
is only one actual client machine (the one on my desk), if we don't 
count the AFR servers (which are also each other's clients).

The load/slowness on the system was caused purely by the disks being 
slow to respond due to the RAID check all the nodes were doing.

It's all very heisenbuggy, I've seen it happen multiple times, but there 
doesn't appear to be a reliably reproducible set of circumstances that 
causes it.

On a different volume, I'm seeing other weirdness under the same high disk
load conditions (software RAID check/resync on all server nodes). This seems
to be specifically related to using writebehind+iocache on the client-side
on one of he servers, exported via unfsd (the one from the gluster ftp
site). What happens is that the /home volume simply seems to disappear
underneath unfsd! The attached log indicates a glusterfsd crash.

This doesn't happen if I remove the writebehind and io-cache translators.

Other notable things about the setup that might help figure out the cause of
this:

- The other two servers are idle - they are not serving any requests. They
are, however, also under the same high disk load.

- writebehind and io-cache is only applied on one server the one behing used
to export via unfsd. The other servers do not have those translators
applied. The volume config is attached. It is called home-cache.vol, but
this is the same file the log file refers to even though it is listed there
as home.vol.

The problem specifically occurs when servers are undergoing high load of the
described nature that causes disk latencies to go up massively. I have not
observed any instances of a similar crash happening without the writebehind
and io-cache translators.

Can you send us a backtrace of the core from gdb (command: "thread
apply all bt full")?

Will do.

Gordan