On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote: > On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote: > > @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode); > > static struct inode *find_inode(struct super_block *sb, > > struct hlist_head *head, > > int (*test)(struct inode *, void *), > > - void *data) > > + void *data, bool locked) > > { > > struct hlist_node *node; > > struct inode *inode = NULL; > > > > repeat: > > - hlist_for_each_entry(inode, node, head, i_hash) { > > + rcu_read_lock(); > > + hlist_for_each_entry_rcu(inode, node, head, i_hash) { > > spin_lock(&inode->i_lock); > > + if (inode_unhashed(inode)) { > > + spin_unlock(&inode->i_lock); > > + continue; > > + } > > Is this check too early? If the unhashed inode happened to be the target > inode, we are wasting our time to continue the traversal and we do not wait > on it. If the inode is unhashed, then it is already passing through evict() or has already passed through. If it has already passed through evict() then it is too late to call __wait_on_freeing_inode() as the wakeup occurs in evict() immediately after the inode is removed from the hash. i.e: remove_inode_hash(inode); spin_lock(&inode->i_lock); wake_up_bit(&inode->i_state, __I_NEW); BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); spin_unlock(&inode->i_lock); i.e. if we get the case: Thread 1, RCU hash traversal Thread 2, evicting foo rcu_read_lock() found inode foo remove_inode_hash(inode); spin_lock(&foo->i_lock); wake_up(I_NEW) spin_unlock(&foo->i_lock); destroy_inode() ...... spin_lock(foo->i_lock) match sb, ino I_FREEING rcu_read_unlock() <rcu grace period can expire at any time now, so use after free is guaranteed at some point> wait_on_freeing_inode wait_on_bit(I_NEW) <will never get woken> Hence if the inode is unhashed, it doesn't matter what inode it is, it is never valid to use it any further because it may have already been freed and the only reason we can safely access here it is that the RCU grace period will not expire until we call rcu_read_unlock(). > > @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino) > > struct inode *old; > > > > spin_lock(&inode_hash_lock); > > - /* We released the lock, so.. */ > > - old = find_inode_fast(sb, head, ino); > > + old = find_inode_fast(sb, head, ino, true); > > if (!old) { > > inode->i_ino = ino; > > spin_lock(&inode->i_lock); > > Emmmm ... couldn't we use memory barrier API instead of irrelevant spin > lock on newly allocated inode to publish I_NEW? Yes, we could. However, having multiple synchronisation methods for a single variable that should only be used in certain circumstances is something that is easy to misunderstand and get wrong. Memory barriers are much more subtle and harder to understand than spin locks, and every memory barrier needs to be commented to explain what the barrier is actually protecting against. In the case where a spin lock is guaranteed to be uncontended and the cache line hot in the CPU cache, it makes no sense to replace the spin lock with a memory barrier, especially when every other place we modify the i_state/i_hash fields we have to wrap them with i_lock.... Simple code is good code - save the complexity for something that needs it. > I go through many mails of the last trend of scaling VFS. Many patches > seem quite natural, say RCU inode lookup Sure, but the implementation in those RCU lookup patches sucked. > or per-bucket inode hash lock or It was a bad idea. At minimum, you can't use lockdep on it. Worse for the realtime guys is the fact it can't be converted to a sleeping lock. Worst was the refusal to change it in any way to address concerns. And realistically, the fundamental problem is not with the inode_hash_lock, it's with the fact that the cache is based on a hash table rather than a more scalable structure like a radix tree or btree. This is a primary reason for XFS having it's own inode cache - hashes can only hold a certain number of entries before performance collapses catastrophically and so don't scale well to tens or hundreds of millions of entries..... > per-superblock inode list lock, Because it isn't a particularly hot lock, and given that most workloads hit on a single filesystem, scalability is not improved by making this change. As such, as long as there is a single linked list used to iterate all inodes in the superblock, a single lock is as good as scalability will get.... > did not get merged. I wonder what stopped them back then and what > has changed that (part of) them can be considered again. A decent RCU inode hash walk implementation is needed, mainly for NFS servers as filehandle decode is really the only application that can drive serious lookup concurrency on the inode hash as the dentry cache handles all other lookup concurrency paths. Modifying any of the other inode/dentry locking requires profiles that show the lock is a scalability problem and that the change makes it go away. I know that the per-sb inode lru lock is currently the hotest of the inode cache locks (performance limiting at somewhere in the range of 8-16way workloads on XFS), and I've got work in (slow) progress to address that. That work will also the address the per-sb dentry LRU locks, which are the hotest dentry cache locks as well. The only other heavily contended lock that I remember for filesystems other than XFS is the per-bdi writeback list_lock. It is hot on ext4 under heavy metadata workloads. XFS avoids that problem by completely avoiding the bdi inode writeback lists for inodes that are only metadata dirty.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html