On 10/26/2014 08:25 AM, Thomas Gleixner wrote:
On Thu, 23 Oct 2014, Chris Friesen wrote:
On 10/17/2014 12:55 PM, Austin Schuh wrote:
Use the 121 patch. This sounds very similar to the issue that I helped
debug with XFS. There ended up being a deadlock due to a bug in the
kernel work queues. You can search the RT archives for more info.
I can confirm that the problem still shows up with the rt121 patch. (And
also with Paul Gortmaker's proposed 3.4.103-rt127 patch.)
We added some instrumentation and it looks like we've tracked down the problem.
Figuring out how to fix it is proving to be tricky.
Basically it looks like we have a circular dependency involving the
inode->i_data_sem rt_mutex, the PG_writeback bit, and the BJ_Shadow list. It
goes something like this:
jbd2_journal_commit_transaction:
1) set page for writeback (set PG_writeback bit)
2) put jbd2 journal head on BJ_Shadow list
3) sleep on PG_writeback bit waiting for page writeback complete
ext4_da_writepages:
1) ext4_map_blocks() acquires inode->i_data_sem for writing
2) do_get_write_access() sleeps waiting for jbd2 journal head to come off
the BJ_Shadow list
At this point the flush code can't run because it can't acquire
inode->i_data_sem for reading, so the page will never get written out.
Deadlock.
Sorry, I really cannot map that sparse description to any code
flow. Proper callchains for the involved parts might help to actually
understand what you are looking for.
There are details (stack traces, etc.) in the first message in the thread:
http://www.spinics.net/lists/linux-rt-users/msg12261.html
Originally we had thought that nfsd might have been implicated somehow,
but it seems like it was just a trigger (possibly by increasing the rate
of sync I/O).
In the interest of full disclosure I should point out that we're using a
modified kernel so there is a chance that we have introduced the problem
ourselves. That said, we have not made significant changes to either
ext4 or jbd2. (Just a couple of minor cherry-picked bugfixes.)
The relevant code paths are:
Journal commit. The important thing here is that we set the
PG_writeback on a page, put the jbd2 journal head on BJ_Shadow list,
then sleep waiting for page writeback complete. If the page writeback
never completes, then the journal head never comes off the BJ_Shadow list.
jbd2_journal_commit_transaction
journal_submit_data_buffers
journal_submit_inode_data_buffers
generic_writepages
set_page_writeback(page) [PG_writeback]
jbd2_journal_write_metadata_buffer
__jbd2_journal_file_buffer(jh_in, transaction, BJ_Shadow);
journal_finish_inode_data_buffers
filemap_fdatawait
filemap_fdatawait_range
wait_on_page_writeback(page)
wait_on_page_bit(page, PG_writeback) <--stuck here
jbd2_journal_unfile_buffer(journal, jh) [delete from BJ_Shadow list]
We can get to the code path below a couple of different ways (see
further down). The important stuff here is:
1) There is a code path that takes i_data_sem and then goes to sleep
waiting for the jbd2 journal head to be removed from the BJ_Shadow list.
If the journal head never comes off the list, the sema will never be
released.
2) ext4_map_blocks() always takes a read lock on i_data_sem. If the
sema is held by someone waiting for the journal head to come off the
list, it will block.
ext4_da_writepages
write_cache_pages_da
mpage_da_map_and_submit
ext4_map_blocks
down_read((&EXT4_I(inode)->i_data_sem))
up_read((&EXT4_I(inode)->i_data_sem))
down_write((&EXT4_I(inode)->i_data_sem))
ext4_ext_map_blocks
ext4_mb_new_blocks
ext4_mb_mark_diskspace_used
__ext4_journal_get_write_access
jbd2_journal_get_write_access
do_get_write_access
wait on BJ_Shadow list
One of the ways we end up at ext4_da_writepages() is via the page
writeback thread. If i_data_sem is already held by someone that is
sleeping, this can result in pages not getting written out.
bdi_writeback_thread
wb_do_writeback
wb_check_old_data_flush
wb_writeback
__writeback_inodes_wb
writeback_sb_inodes
writeback_single_inode
do_writepages
ext4_da_writepages
Another way to end up at ext4_da_writepages() is via sync writev()
calls. In the traces from my original report this ended up taking the
sema and then going to sleep waiting for the journal head to get removed
from the BJ_Shadow list.
sys_writev
vfs_writev
do_readv_writev
do_sync_readv_writev
ext4_file_write
generic_file_aio_write
generic_write_sync
ext4_sync_file
filemap_write_and_wait_range
__filemap_fdatawrite_range
do_writepages
ext4_da_writepages
Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html