On Mon 13-09-10 12:41:28, Dave Chinner wrote: > ping? Pong ;) I finally had a look at this. Thanks for reporting this. > > I just had an umount take a very long time burning a CPU the entire > > time. It wasn't the unmount thread, either, it was the the bdi > > flusher thread for the the filesystem being unmounted. It was > > spinning with this perf top trace: > > > > 553144.00 76.9% writeback_inodes_wb [kernel.kallsyms] > > 106434.00 14.8% __ticket_spin_lock [kernel.kallsyms] > > 25646.00 3.6% __ticket_spin_unlock [kernel.kallsyms] > > 10512.00 1.5% _raw_spin_lock [kernel.kallsyms] > > 9606.00 1.3% put_super [kernel.kallsyms] > > 7920.00 1.1% __put_super [kernel.kallsyms] > > 5592.00 0.8% down_read_trylock [kernel.kallsyms] > > 46.00 0.0% kfree [kernel.kallsyms] > > 22.00 0.0% __do_softirq [kernel.kallsyms] > > 19.00 0.0% wb_writeback [kernel.kallsyms] > > 16.00 0.0% wb_do_writeback [kernel.kallsyms] > > 8.00 0.0% queue_io [kernel.kallsyms] > > 6.00 0.0% run_timer_softirq [kernel.kallsyms] > > 6.00 0.0% local_bh_enable_ip [kernel.kallsyms] > > > > This went on for ~7m25s (according to the pmchart trace I had on > > screen) before something broke the livelock by writing the inodes to > > disk (maybe the xfssyncd) and the unmount then completed a couple > > of seconds later. > > > > From the above profile, I'm assuming that writeback_inodes_wb() was > > seeing pin_sb_for_writeback(sb) failing and moving dirty inodes from > > the the b_io to the b_more_io list, then being called again, > > splicing the inodes on b_more_io back to b_io, and then failed again > > to pin_sb_for_writeback() for each inode, moving them back to the > > b_more_io list.... > > > > This is on 2.6.36-rc1 + the radix tree fixes for writeback. Indeed, your analysis looks correct. The trouble is following: Flusher thread Umount - start processing background writeback - get s_mount for writing - queue syncing work for flusher - waits until flusher thread gets to it - loops infinitely, trying to get s_umount for reading In principle a classical ABBA deadlock. Actually, there are more complicated (and harder to hit) cases like: Flusher thread Sync Remount - processes background writeback - gets s_umount for reading - queues syncing work - waits for syncing work - tries to get s_umount for writing and blocks - now loops infinitely since it cannot get s_umount for reading anymore The question is how to properly resolve it. The cases like the second one above show that it's not enough to just somehow work-around writeback during umount. Also it's not only background writeback that can get deadlocked like this but generally anything submitted via __bdi_start_writeback (as these kinds of writeback do not have superblock specified). I think the best resolution of this problem would be to change the work that is submitted via bdi_start_writeback() (i.e., the work without superblock = work which needs to do locking) to "target based scheme" like Christoph wanted already some time ago. I actually have a patch to do this for background writeback so I will just modify it to apply to a wider range of writeback as well. Or Christoph, do you already have some patches in this direction? Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html