Re: Disconnections and Corruption Under High Load

Anand Avati <anand.avati@xxxxxxxxx> · Thu, 7 Jan 2010 09:57:19 +0530

On Tue, Jan 5, 2010 at 5:31 PM, Gordan Bobic <gordan@xxxxxxxxxx> wrote:
> I've noticed a very high incidence of the problem I reported a while back,
> that manifests itself in open files getting corrupted on commit, possibly
> during conditions that involve server disconnections due to timeouts (very
> high disk load). Specifically, I've noticed that my .viminfo file got
> corrupted for the 3rd time today. Since this is root's .viminfo, and I'm
> running glfs as root, I don't have the logs to verify the disconnections,
> though. From what I can tell, a chunk of a dll somehow ends up in .viminfo,
> but I'm not sure which one.

Can you describe the sequence of events? What kind of IO was being
performed from all clients involved? Was vi opened (on the same file?)
from multiple clients? Was some other kind of IO (rsync?) being
performed on another client at the same time?

> On a different volume, I'm seeing other weirdness under the same high disk
> load conditions (software RAID check/resync on all server nodes). This seems
> to be specifically related to using writebehind+iocache on the client-side
> on one of he servers, exported via unfsd (the one from the gluster ftp
> site). What happens is that the /home volume simply seems to disappear
> underneath unfsd! The attached log indicates a glusterfsd crash.
>
> This doesn't happen if I remove the writebehind and io-cache translators.
>
> Other notable things about the setup that might help figure out the cause of
> this:
>
> - The other two servers are idle - they are not serving any requests. They
> are, however, also under the same high disk load.
>
> - writebehind and io-cache is only applied on one server the one behing used
> to export via unfsd. The other servers do not have those translators
> applied. The volume config is attached. It is called home-cache.vol, but
> this is the same file the log file refers to even though it is listed there
> as home.vol.
>
> The problem specifically occurs when servers are undergoing high load of the
> described nature that causes disk latencies to go up massively. I have not
> observed any instances of a similar crash happening without the writebehind
> and io-cache translators.

Can you send us a backtrace of the core from gdb (command: "thread
apply all bt full")?

Thanks,
Avati