On Tue, Nov 26, 2013 at 10:17:16AM +1100, Dave Chinner wrote: > On Sun, Nov 24, 2013 at 06:38:25PM -0500, Johannes Weiner wrote: > > Reclaim will be leaving shadow entries in the page cache radix tree > > upon evicting the real page. As those pages are found from the LRU, > > an iput() can lead to the inode being freed concurrently. At this > > point, reclaim must no longer install shadow pages because the inode > > freeing code needs to ensure the page tree is really empty. > > > > Add an address_space flag, AS_EXITING, that the inode freeing code > > sets under the tree lock before doing the final truncate. Reclaim > > will check for this flag before installing shadow pages. > > > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> > .... > > @@ -545,10 +546,25 @@ static void evict(struct inode *inode) > > */ > > inode_wait_for_writeback(inode); > > > > + /* > > + * Page reclaim can not do iput() and thus can race with the > > + * inode teardown. Tell it when the address space is exiting, > > + * so that it does not install eviction information after the > > + * final truncate has begun. > > + * > > + * As truncation uses a lockless tree lookup, acquire the > > + * spinlock to make sure any ongoing tree modification that > > + * does not see AS_EXITING is completed before starting the > > + * final truncate. > > + */ > > + spin_lock_irq(&inode->i_data.tree_lock); > > + mapping_set_exiting(&inode->i_data); > > + spin_unlock_irq(&inode->i_data.tree_lock); > > + > > if (op->evict_inode) { > > op->evict_inode(inode); > > } else { > > - if (inode->i_data.nrpages) > > + if (inode->i_data.nrpages || inode->i_data.nrshadows) > > truncate_inode_pages(&inode->i_data, 0); > > clear_inode(inode); > > } > > Ok, so what I see here is that we need a wrapper function that > handles setting the AS_EXITING flag and doing the "final" > truncate_inode_pages() call, and the locking for the AS_EXITING flag > moved into mapping_set_exiting() > > That is, because this AS_EXITING flag and it's locking constraints > are directly related to the upcoming truncate_inode_pages() call, > I'd prefer to see a helper that captures that relationship used > in all the filesystem code. e.g: > > void truncate_inode_pages_final(struct address_space *mapping) > { > spin_lock_irq(&mapping->tree_lock); > mapping_set_exiting(mapping); > spin_unlock_irq(&mapping->tree_lock); > if (inode->i_data.nrpages || inode->i_data.nrshadows) > truncate_inode_pages_range(mapping, 0, (loff_t)-1); > } > > And document it in Documentation/filesystems/porting as a mandatory > function to be called from ->evict_inode() implementations before > calling clear_inode(). You can then replace all the direct calls to > truncate_inode_pages() in the evict_inode() path with a call to > truncate_inode_pages_final(). Ok, fair enough. I'll add a BUG_ON(!mapping_exiting(&inode->i_data)) to the inode sanity checks on final teardown to make sure filesystems don't miss the change to truncate_inode_pages_final(). > As it is, I'd really like to see that unconditional irq disable go > away from this code - disabling and enabling interrupts for every > single inode we reclaim is going to add significant overhead to this > hot code path. And given that: > > > +static inline void mapping_set_exiting(struct address_space *mapping) > > +{ > > + set_bit(AS_EXITING, &mapping->flags); > > +} > > + > > +static inline int mapping_exiting(struct address_space *mapping) > > +{ > > + return test_bit(AS_EXITING, &mapping->flags); > > +} > > these atomic bit ops, why do we need to take the tree_lock and > disable irqs in evict() to set this bit if there's nothing to > truncate on the inode? i.e. something like this: > > void truncate_inode_pages_final(struct address_space *mapping) > { > mapping_set_exiting(mapping); > if (inode->i_data.nrpages || inode->i_data.nrshadows) { > /* > * spinlock barrier to ensure all modifications are > * complete before we do the final truncate > */ > spin_lock_irq(&mapping->tree_lock); > spin_unlock_irq(&mapping->tree_lock); > truncate_inode_pages_range(mapping, 0, (loff_t)-1); > } That would almost work, but we need to enforce ordering of the counter reads and updates or truncation might read 0 on both while racing with reclaim. Reclaim would have to do: spin_lock_irq(&mapping->tree_lock) if !mapping_exiting(): swap shadow entry mapping->nrshadows++ smp_wmb() mapping->nrpages-- spin_unlock_irq(&mapping->tree_lock) and the final truncate side would have to do mapping_set_exiting() nrpages = mapping->nrpages smp_rmb() nrshadows = mapping->nrshadows if (nrpages || nrshadows) spin_lock_irq(&mapping->tree_lock) spin_unlock_irq(&mapping->tree_lock) truncate -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html