1. Introduction This lengthy investigation was caused by a deadlock problem in LogFS, but uncovered a more general problem. It affects, at the least, all filesystems that need to read inodes in their write path. To my knowledge, that includes LogFS and NTFS, possibly also JFS and XFS. Deadlock happens when two processes A and B (that may be identical) have these two call chains: Process A: Process B: inode_wait [filesystem locking write path] __wait_on_bit __writeback_single_inode out_of_line_wait_on_bit ifind_fast [filesystem calling iget()] [filesystem locking write path] Process A will wait_on_inode() and block until process B proceeds through __sync_single_inode(), which is called from __writeback_single_inode() in Process B. Process B will block on the lock of the filesystem write path, held by Process A. After investigating several quick solutions - none of them worked - I started asking the question of why ifind_fast() should wait for __sync_single_inode() to finish. Does that make sense? So I tried to describe what I_LOCK is used for. If I made any mistakes in my investigations, please correct me. 2. The usage of inode_lock and I_LOCK Almost all modifications of inodes are protected by the inode_lock, a global spinlock. Some modifications, however, can block for various reasons and require the inode_lock to get dropped temporarily. In the meantime, the individual inode needs to get protected somehow. Usually this happens through the use of I_LOCK. But I_LOCK is not a simple mutex. It is a Janus-faced bit in the inode that is used for several things, including mutual exclusion and completion notification. Most users are open-coded, so it is not easy to follow, but can be summarized in the table below. In this table columns indicate events when I_LOCK is either set or reset (or not reset but all waiters are notified anyway). Rows indicate code that either checks for I_LOCK and changes behaviour depending on its state or is waiting until I_LOCK gets reset (or is waiting even if I_LOCK is not set). __sync_single_inode | get_new_inode[_fast] | | unlock_new_inode | | | dispose_list | | | | generic_delete_inode | | | | | generic_forget_inode lock v v | | | | unlock/complete v v v v v comment ------------------------------------------------------------------------------- __writeback_single_inodeX O O O O sync write_inode_now X O O O O sync clear_inode X O O O O sync __mark_inode_dirty X O O O O lists generic_osync_inode ? ? ? ? ? ? get_new_inode[_fast] O X O O O mutex find_inode[_fast] O O X X X I_FREEING ifind[_fast] O X O O O read jfs txCommit ? ? ? ? ? ? xfs_ichgtime[_fast] ? ? ? ? ? ? Comments: sync - wait for writeout to finish lists - move inode to dirty list without racing against __sync_single_inode mutex - protect against two concurrent get_new_inode[_fast] creating two inodes I_FREEING - wait for inode to get freed, then repeat read - don't return inode until it is read from medium Now, the "X"s mark combinations where columns and rows are related. "O"s mark combinations where afaiks columns and rows share no relationship whatsoever except that both use either I_LOCK or wake_up_inode()/wait_on_inode() or any other of the partially open-coded variants. Three rows have not exposed their meaning to me yet, so I'd gladly receive some insight here. 3. Seperating out sync notification One of the results from the investigations in 2 appears to be that one class of users in fs/fs-writeback.c is completely unrelated to another class of users in fs/inode.c. In particular, __sync_single_inode(), __writeback_single_inode(), write_inode_now(), clear_inode(), __mark_inode_dirty() and (possibly?) generic_osync_inode() seem to only need a completion event to synchronize with. There is no reason why this group should share a lock with iget() and any of its many permutations. Now, if the group in fs/fs-writeback.c had a completion event that is independent of anything in fs/inode.c, the deadlock scenario described in section 1 goes away. As a further result, ilookup5_nowait() can get removed as its only user, NTFS, only introduced it as a workaround for the deadlock scenario. Potentially the checks in JFS and XFS can also get removed, but I don't claim to understand their meaning yet. Comments? Jörn -- Prosperity makes friends, adversity tries them. -- Publilius Syrus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html