On Fri, May 22, 2020 at 04:48:59PM -0700, Darrick J. Wong wrote: > On Fri, May 22, 2020 at 01:50:24PM +1000, Dave Chinner wrote: > > @@ -2649,53 +2650,12 @@ xfs_ifree_cluster( > > bp->b_ops = &xfs_inode_buf_ops; > > > > /* > > - * Walk the inodes already attached to the buffer and mark them > > - * stale. These will all have the flush locks held, so an > > - * in-memory inode walk can't lock them. By marking them all > > - * stale first, we will not attempt to lock them in the loop > > - * below as the XFS_ISTALE flag will be set. > > - */ > > - list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { > > - if (lip->li_type == XFS_LI_INODE) { > > - iip = (struct xfs_inode_log_item *)lip; > > - xfs_trans_ail_copy_lsn(mp->m_ail, > > - &iip->ili_flush_lsn, > > - &iip->ili_item.li_lsn); > > - xfs_iflags_set(iip->ili_inode, XFS_ISTALE); > > - } > > - } > > Hm. I think I'm a little confused here. I think the consequence of > attaching inode items to the buffer whenever we dirty the inode is that > we no longer need to travel the inode list to set ISTALE because we know > that the lookup loop below is sufficient to catch all of the inodes that > are still hanging around in memory? Yes. The issue here is that we now have inodes on this list that are not flush locked, and so we can't just walk it assuming we can change the flush state without holding the ILOCK to first ensure the inode is not racing with reclaim, etc. > We don't call xfs_ifree_cluster if any of the inodes in it are > allocated, which means that all the on-disk inodes are either (a) not > represented in memory, in which case we don't have to stale them; or (b) > they're in memory and dirty (because they've recently been freed). But > if that's true, then surely you could find them all via b_li_list? No, we can have inodes that are free but clean in memory when we free the cluster. do an unlink of an inode, commit, push the AIL, it gets written back and is now clean in memory with mode == 0 and state XFS_IRECLAIMABLE. That inode is not reclaimed until memory reclaim or the background reclaimer finds it and reclaims it. Those are the inodes we want to mark XFS_ISTALE so they get treated by lookup correctly if the cluster is reallocated and the inode reinstantiated before it is reclaimed from memory. > On the other hand, it seems redundant to iterate the list /and/ do the > lookups and we no longer need to "set it up for being staled", so the > second loop survives? Yeah, the second loop is used because we have to look up the cache anyway, and there's no point in iterating the buffer list because it now has to do all the same "is it clean" checks and flush locking, etc that the cache lookup has to do. A future patchset can make the cache lookup use a gang lookup on the radix tree to optimise it if necessary... > > - > > - > > - /* > > - * For each inode in memory attempt to add it to the inode > > - * buffer and set it up for being staled on buffer IO > > - * completion. This is safe as we've locked out tail pushing > > - * and flushing by locking the buffer. > > - * > > - * We have already marked every inode that was part of a > > - * transaction stale above, which means there is no point in > > - * even trying to lock them. > > + * Now we need to set all the cached clean inodes as XFS_ISTALE, > > + * too. This requires lookups, and will skip inodes that we've > > + * already marked XFS_ISTALE. > > */ > > - for (i = 0; i < igeo->inodes_per_cluster; i++) { > > - ip = xfs_ifree_get_one_inode(pag, free_ip, inum + i); > > - if (!ip) > > - continue; > > - > > - iip = ip->i_itemp; > > - spin_lock(&iip->ili_lock); > > - iip->ili_last_fields = iip->ili_fields; > > - iip->ili_fields = 0; > > - iip->ili_fsync_fields = 0; > > - spin_unlock(&iip->ili_lock); > > - xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, > > - &iip->ili_item.li_lsn); > > - > > - list_add_tail(&iip->ili_item.li_bio_list, > > - &bp->b_li_list); > > - > > - if (ip != free_ip) > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > - } > > + for (i = 0; i < igeo->inodes_per_cluster; i++) > > + xfs_ifree_mark_inode_stale(pag, free_ip, inum + i); > > I get that we're mostly just hoisting everything in the loop body to the > end of xfs_ifree_get_one_inode, but can that be part of a separate hoist > patch? Ok. Not really happy about breaking it up to even more fine grained patches as this patchset is already a nightmare to keep up to date against all the random cleanups going into for-next..... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx