On Wed 11-07-12 12:05:51, Jeff Moyer wrote: > Jan Kara <jack@xxxxxxx> writes: > > > Hello, > > > > we've recently hit a deadlock in our QA runs which is caused by the > > per-process plugging code. The problem is as follows: > > process A process B (kjournald) > > generic_file_aio_write() > > blk_start_plug(&plug); > > ... > > somewhere in here we allocate memory and > > direct reclaim submits buffer X for IO > > ... > > ext3_write_begin() > > ext3_journal_start() > > we need more space in a journal > > so we want to checkpoint old transactions, > > we block waiting for kjournald to commit > > a currently running transaction. > > journal_commit_transaction() > > wait for IO on buffer X > > to complete as it is part > > of the current transaction > > > > => deadlock since A waits for B and B waits for A to do unplug. > > BTW: I don't think this is really ext3/ext4 specific. I think other > > filesystems can get into problems as well when direct reclaim submits some > > IO and the process subsequently blocks without submitting the IO. > > So, I thought schedule would do the flush. Checking the code: > > asmlinkage void __sched schedule(void) > { > struct task_struct *tsk = current; > > sched_submit_work(tsk); > __schedule(); > } > > And sched_submit_work looks like this: > > static inline void sched_submit_work(struct task_struct *tsk) > { > if (!tsk->state || tsk_is_pi_blocked(tsk)) > return; > /* > * If we are going to sleep and we have plugged IO queued, > * make sure to submit it to avoid deadlocks. > */ > if (blk_needs_flush_plug(tsk)) > blk_schedule_flush_plug(tsk); > } > > This eventually ends in a call to blk_run_queue_async(q) after > submitting the I/O from the plug list. Right? So is the question > really why doesn't the kblockd workqueue get scheduled? Ah, I didn't know this. Thanks for the hint. So in the kdump I have I can see requests queued in tsk->plug despite the process is sleeping in TASK_UNINTERRUPTIBLE state. So the only way how unplug could have been omitted is if tsk_is_pi_blocked() was true. Rummaging through the dump... indeed task has pi_blocked_on = 0xffff8802717d79c8. The dump is from an -rt kernel (I just didn't originally thought that makes any difference) so actually any mutex is rtmutex and thus tsk_is_pi_blocked() is true whenever we are sleeping on a mutex. So this seems like a bug in rtmutex code. Thomas, you seemed to have added that condition... Any idea how to avoid the deadlock? Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html