Re: stuck processes on GFS partition?

Matt Brookover <mbrookov@xxxxxxxxx> · Thu, 15 Dec 2005 14:54:14 -0700

I was the first to get a process stuck in a device wait.  I created a directory in the root of the file system and then tried to do an ls.  The ls got stuck.  From the looks of the logs, the problems had started the day before, but went unnoticed until I did an ls.  The new directory worked from other nodes that had mounted that GFS file system.  

Unfortunately, I do not believe that the server was doing much of any thing at the time.  There were a few users, mostly reading email, and not using the file system that had the problem.  The partition in question is used for mail lists and a dumping ground for backups for 6 other servers.  The backups were not running at the time the first gfs_releasepage() message was logged. The mail lists are just test lists and not in use yet. The backups transfer about 10GB of data in 12 to 15 files between 3am and 5am every day.  The backups are transfered by scp (the only path through a firewall). The backups that night ran without any problems, both the copy from the remote servers and a copy of that file system to tape.

If/when it happens again, I will try to have a better idea of what was going on at the time.

The server in question had been up for over 30 days when the problem started.

Thank you

Matt

On Thu, 2005-12-15 at 14:24, Andrew C. Dingman wrote:

On Mon, 2005-12-12 at 14:51 -0700, Matt Brookover wrote:
> This looks like a similar problem to the one described in bugzilla
> 160409.  It does not look like there ever was a solution. 
> 

It does look similar, and there was no solution. We were never able to
re-produce the problem by any method other than putting it into
production. I think the theory we ended up with was that there was some
sort of lock contention problem, possibly having to do with the network
here. It was just a theory, though. We never managed to prove anything.

Do you know what you did to trigger it? I assume something other than
300 people running jBase applications?

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster