Re: Major lock-up problem

"Anand Avati" <avati@xxxxxxxxxxxxx> · Thu, 10 Jan 2008 00:30:21 +0530

Gareth,
 See if frequent flushing of pdflush rather than letting it aggregate
changes the situation. here is a link with some interesting tips.

http://www.westnet.com/~gsmith/content/linux-pdflush.htm

avati

2008/1/9, Gareth Bult <gareth@xxxxxxxxxxxxx>:
>
> Ok, this looks like being a XEN/kernel issue as I can reproduce it without
> actually "using" the glusterfs, even though it's there and mounted.
>
> I've included my XEN mailing list post here as this problem could well
> effect anyone else using gluster and XEN and it's a bit nasty in that it
> becomes more frequent the less memory you have .. so the more XEN instances
> you add, the more unstable your server becomes.
>
> (and I'm fairly convinced gluster is "the" FS to use with XEN ..
> especially when the current feature requests are processed)
>
> :)
>
> Regards,
> Gareth.
>
> -----------
>
> Posting to XEN list;
>
> Ok, I've been chasing this for many days .. I have a server running 10
> instances that periodically freezes .. then sometimes "comes back."
>
> I tried many things to try to spot the problem and finally found it by
> accident.
> It's a little frustrating as typically the Dom0 and One (or two) instances
> "go" and the rest carry on .. and there is diddley squat when it comes to
> logging information or error messages.
>
> I'm now using 'watch "cat /proc/meminfo"' in the Dom0.
> I watch the Dirty figure increase, and occasionally decrease.
>
> In an instance (this is just an easy way to reproduce it quickly) do;
> dd if=/dev/zero of=/tmp/bigfile bs=1M count=1000
>
> Watch the "dirty" rise and at some point you'll see "writeback" cut in.
> All looks good.
>
> Give it a few seconds and your "watch" of /proc/meminfo will freeze.
> On my system "Dirty" will at this point be reading about "500M" and
> "writeback" will have gone down to zero.
> "xm list" in another session will confirm that you have a major problem.
> (it will hang)
>
> For some reason PDFLUSH is not working properly !!!
> On another shell "sync" and the machine instantly jumps back to life!
>
> I'm running a stock Ubuntu XEN 3.1 kernel.
> File back XEN instances, typically 5Gb with 1Gb swap.
> Dual / Dual Core 2.8G Xeon (4 in total) with 6Gb RAM.
> Twin 500Gb SATA HDD (software RAID1)
>
> To my way of thinking (!) when it runs out of memory, it should force a
> sync (or similar) and it's not, it's just sitting there. If I wait for the
> dirty_expire_centisecs timer to expire, I may get some life back, some
> instances will survive and some will have hung.
>
> Here's a working "meminfo";
>
> MemTotal:       860160 kB
> MemFree:         22340 kB
> Buffers:         49372 kB
> Cached:         498416 kB
> SwapCached:      15096 kB
> Active:          92452 kB
> Inactive:       491840 kB
> SwapTotal:     4194288 kB
> SwapFree:      4136916 kB
> Dirty:            3684 kB
> Writeback:           0 kB
> AnonPages:       29104 kB
> Mapped:          13840 kB
> Slab:            45088 kB
> SReclaimable:    25304 kB
> SUnreclaim:      19784 kB
> PageTables:       2440 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> CommitLimit:   4624368 kB
> Committed_AS:   362012 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed:      3144 kB
> VmallocChunk: 34359735183 kB
>
> Here's one where "xm list" hangs, but my "watch" is still updating the
> /proc/meminfo display;
>
> MemTotal:       860160 kB
> MemFree:         13756 kB
> Buffers:         53656 kB
> Cached:         502420 kB
> SwapCached:      14812 kB
> Active:          84356 kB
> Inactive:       507624 kB
> SwapTotal:     4194288 kB
> SwapFree:      4136900 kB
> Dirty:          213096 kB
> Writeback:           0 kB
> AnonPages:       28832 kB
> Mapped:          13924 kB
> Slab:            45988 kB
> SReclaimable:    25728 kB
> SUnreclaim:      20260 kB
> PageTables:       2456 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> CommitLimit:   4624368 kB
> Committed_AS:   361796 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed:      3144 kB
> VmallocChunk: 34359735183 kB
>
> Here's a frozen one;
>
> MemTotal:       860160 kB
> MemFree:         15840 kB
> Buffers:          2208 kB
> Cached:         533048 kB
> SwapCached:       7956 kB
> Active:          49992 kB
> Inactive:       519916 kB
> SwapTotal:     4194288 kB
> SwapFree:      4136916 kB
> Dirty:          505112 kB
> Writeback:        3456 kB
> AnonPages:       34676 kB
> Mapped:          14436 kB
> Slab:            64508 kB
> SReclaimable:    18624 kB
> SUnreclaim:      45884 kB
> PageTables:       2588 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> CommitLimit:   4624368 kB
> Committed_AS:   368064 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed:      3144 kB
> VmallocChunk: 34359735183 kB
>
> Help!!!
>
> Gareth.
>
> --
> Managing Director, Encryptec Limited
> Tel: 0845 25 77033, Mob: 07853 305393, Int: 00 44 1443205756
> Email: gareth@xxxxxxxxxxxxx
> Statements made are at all times subject to Encryptec's Terms and
> Conditions of Business, which are available upon request.
>
> ----- Original Message -----
> From: "Gareth Bult" <gareth@xxxxxxxxxxxxx>
> To: "gluster-devel" <gluster-devel@xxxxxxxxxx>
> Sent: Wednesday, January 9, 2008 3:40:49 PM (GMT) Europe/London
> Subject: Major lock-up problem
>
>
> Hi,
>
> I've been developing a new system (which is now "live", hence the lack of
> debug information) and have been experiencing lots of inexplicable lock up
> and pause problems with lots of different components, and I've been working
> my way through the systems removing / fixing problems as I go.
>
> I seem to have a problem with gluster I can't nail down.
>
> When hitting the server with sustained (typically multi-file) writes,
> after a while the server goes "D" state.
> If I have io-threads running on the server, only ONE process goes "D"
> state.
>
> Trouble is, it stays "D" state and starts to lock up other processes .. a
> favourite is "vi".
>
> Funny thing is, the machine is a XEN server (glusterfsd in the Dom0) and
> the XEN instances NOT using gluster are not affected.
> Some of the instances using the glusterfs are affected, depending on
> whether io-threads is used on the server.
>
> If I'm lucky, I kill the IO process and 5 mins later the machine springs
> back to life.
> If I'm not, I reboot.
>
> Anyone any ideas?
>
> glfs7 and tla.
>
> Gareth.
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>

-- 
If I traveled to the end of the rainbow
As Dame Fortune did intend,
Murphy would be there to tell me
The pot's at the other end.