On Jun 26, 2003 06:46 -0700, Dale wrote: > I have a problem with an NFS server for my network. It has ran kernels > 2.4.18-ac4 - 2.4.21-ac1, all with problems. The -ac patches are used > to provide the new style quota support. The system seems to have > gotten even less stable with the new kernel versions. > > This morning around 5 am, I got a page the system was unresponding to > NFS requests. I ssh'd in, and found the loadavg at ~50. Below are > some snippets from ps at the time: > > root 3414 0.8 0.1 3904 3048 ? DN 04:02 1:45 > /usr/bin/updatedb -f NFS,SMBFS,NCPFS,PROC,DEVPTS -e /tmp,/var/tmp,/us > root 3979 0.0 0.0 2588 1192 ? DN 04:14 0:00 > /usr/bin/rsync -aH --delete /home/puser1 /home/puser2 /home/puser3 > > The rsync command is backing up across the network to a backup nfs > server. updatedb starts at 4:02 am, and the rsync had been running > since 3:30 and was half-way completed (estimated by the 'p' in the > uername). > > Also there were 32 nfsd's just like this: > root 851 0.0 0.0 0 0 ? DW Jun19 4:35 [nfsd] > > and these, the other 4 kjournald's were in SW. > root 7 0.1 0.0 0 0 ? DW Jun19 17:04 [kswapd] > root 144 0.0 0.0 0 0 ? DW Jun19 6:53 [kjournald] > > I'm wondering what my options are, this has happened ~10 times in the > last 6 months, although the system went a period of ~120 days without a > hiccup. This last time on 2.4.21-ac1 was only 6 days. > It wouldn't be so bad if a `shutdown -r now` would restart it, but it > hangs while shutting down nfs and during killall and needs hard > rebooted. This almost certainly is a lock deadlock of some sort. I've had pretty good luck in debugging such problems just by running "sysrq-T" on the console and/or using "crash" to examine the running kernel. This needs a fair amount of knowledge of the various locks in ext3. The most common problems are related to lock ordering problems with some process starting a journal transaction and then blocking on a lock (e.g. directory or inode semaphore, or superblock lock), and some other process holding that lock and trying to start a new transaction when the journal is full. The journal being full is a crucial issue, because if it isn't full you can start a new transaction without problems, but when it is full you need to flush the journal and wait for all existing users to free up their handles, which will never happen if the first process has a transaction handle and is blocked waiting for a lock the second process is holding. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ _______________________________________________ Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users