Andreas Dilger wrote:
On Jun 16, 2007 08:17 -0500, John Marconi wrote:
I am running into a situation in which one of my ext3 filesystems is
getting hung during normal usage. There are three ext3 filesystems on a
CompactFLASH. One is mounted as / and one as /tmp. In my test, I am
copying a 100 MB file from /root to /tmp repeatedly. While doing this
test, I eventually see the copying stop, and any attempts to access /tmp
fail - if I even do ls /tmp the command will hang.
I suspect kjournald because of the following ps output:
PID PPID WCHAN:20 PCPU %MEM PSR COMM
8847 99 start_this_handle 1.1 0.0 28 pdflush
8853 99 schedule_timeout 0.2 0.0 7 pdflush
188 1 kswapd 0.0 0.0 19 kswapd0
8051 1 mtd_blktrans_thread 0.0 0.0 22 mtdblockd
8243 1 kjournald 0.0 0.0 0 kjournald
8305 1 schedule_timeout 0.0 0.0 2 udevd
8378 1 kjournald 0.0 0.0 0 kjournald
8379 1 journal_commit_trans 16.6 0.0 0 kjournald
8437 1 schedule_timeout 0.0 0.0 0 evlogd
8527 1 syslog 0.0 0.0 1 klogd
8534 1 schedule_timeout 0.0 0.0 0 portmap
8569 1 schedule_timeout 0.0 0.0 0 rngd
8639 1 schedule_timeout 0.1 0.0 24 sshd
8741 8639 schedule_timeout 0.0 0.0 0 sshd
8743 8741 wait 0.0 0.0 9 bash
8857 8743 schedule_timeout 4.9 0.0 7 cp
8664 1 schedule_timeout 0.0 0.0 0 xinetd
8679 1 schedule_timeout 0.0 0.0 0 evlnotifyd
8689 1 schedule_timeout 0.0 0.0 0 evlactiond
8704 1 wait 0.0 0.0 1 bash
8882 8704 - 0.0 0.0 2 ps
If I run ps repeatedly, I always see process 8379 in
journal_commit_transaction, and it is always taking between 12% and 20%
of processor 0 up. This process never completes. I also see process
8847 in start_this_handle forever as well - so I believe they are related.
This system is using a 2.6.14 kernel.
Please try to reproduce with a newer kernel, as this kind of problem
might have been fixed already.
Two tips for debugging this kind of issue:
- you need to have detailed stack traces (e.g. sysrq-t) of all the
interesting processes
- if a process is stuck inside a large function (e.g. 8379 in example)
you need to provide the exact line number. this can be found by compiling
the kernel with CONFIG_DEBUG_INFO (-g flag to gcc) and then doing
"gdb vmlinux" and "p *(journal_commit_transaction+{offset})", where the
byte offset is printed in the sysrq-t output, and then include the code
surrounding that line from the source file
- a process stuck in "start_this_handle()" is often just an innocent
bystander. It is waiting for the currently committing transaction to
complete before it can start a new filesystem-modifying operation (handle).
That said, the journal handle acts like a lock and has been the cause of
many deadlock problems (e.g. process 1 holds lock, waits for handle;
process 2 holds transaction open waiting for lock). pdflush might be one
of the "process 1" kind of tasks, and some other process is holding the
transaction open preventing it from completing.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
Andreas,
Thanks for the information.
I am not able to update the entire kernel to a new version for a variety
of reasons, however I can update certain parts in my system (such as the
filesystem). I did a diff of the 2.6.16 kernel against my kernel, and
the changes to jbd were minimal. I plan on looking at the latest
versions of the kernel to determine if anything has changed since 2.6.16.
I took a look at the place that kjournald was stuck - it is in the
journal_commit_transaction "while (comiit_transaction->t_updates)" loop
and it is trying to "spin_lock(&journal->j_state_lock). When I look at
pdflush, it is also trying to take the journal->j_state_lock. Do you
have any tips on finding out which process might own journal->j_state_lock?
Thanks again,
John
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html