Hello, When there's heavy metadata operation traffic on ext4, the journal gets filled soon and majority of filesystem users end up blocking on journal->j_checkpoint_mutex with a stacktrace similar to the following. [<ffffffff8c32e758>] __jbd2_log_wait_for_space+0xb8/0x1d0 [<ffffffff8c3285f6>] add_transaction_credits+0x286/0x2a0 [<ffffffff8c32876c>] start_this_handle+0x10c/0x400 [<ffffffff8c328c5b>] jbd2__journal_start+0xdb/0x1e0 [<ffffffff8c30ee5d>] __ext4_journal_start_sb+0x6d/0x120 [<ffffffff8c2d713e>] __ext4_new_inode+0x64e/0x1330 [<ffffffff8c2e9bf0>] ext4_create+0xc0/0x1c0 [<ffffffff8c2570fd>] path_openat+0x124d/0x1380 [<ffffffff8c258501>] do_filp_open+0x91/0x100 [<ffffffff8c2462d0>] do_sys_open+0x130/0x220 [<ffffffff8c2463de>] SyS_open+0x1e/0x20 [<ffffffff8c7ec5b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4 [<ffffffffffffffff>] 0xffffffffffffffff Because the sleeps on the mutex aren't accounted as iowait, the system doesn't show the usual signs of being bogged down by IOs - both iowait and /proc/stat:procs_blocked stay misleadingly low. While propagation of iowait through locking constructs is far from being strict, heavy contention on j_checkpoint_mutex is easy to trigger, obviously iowait and getting it right can help users in tracking down the issue quite a bit. Due to the way io_schedule() is implemented, it currently is hairy to add an io variant to an existing interface - the schedule() call itself, which is usually buried deep, should be replaced with io_schedule(). As we already have current->in_iowait to mark the task as sleeping for iowait, this can be made easy by breaking up io_schedule() into multiple steps so that the preparation and marking can be done before calling an existing interafce and the actual iowait accounting can be done from inside the scheduler. What do you think? This patch contains the following four patches. 0001-sched-move-IO-scheduling-accounting-from-io_schedule.patch 0002-sched-separate-out-io_schedule_prepare-and-io_schedu.patch 0003-mutex-add-mutex_lock_io.patch 0004-jbd2-use-mutex_lock_io-for-journal-j_checkpoint_mute.patch 0001-0002 implement io_schedule_prepare/finish(). 0003 implements mutex_lock_io() using io_schedule_prepare/finish(). 0004 uses mutex_lock_io() on journal->j_checkpoint_mutex. This patchset is also available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-mutex_lock_io Thanks, diffstat follows. fs/jbd2/commit.c | 2 - fs/jbd2/journal.c | 14 ++++++------- include/linux/mutex.h | 4 +++ include/linux/sched.h | 8 ++----- kernel/locking/mutex.c | 24 ++++++++++++++++++++++ kernel/sched/core.c | 52 +++++++++++++++++++++++++++++++++++++------------ 6 files changed, 79 insertions(+), 25 deletions(-) -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html