On Thu, Oct 02, 2014 at 08:27:55AM -0500, Mark Tinguely wrote: > On 10/01/14 17:34, Dave Chinner wrote: > >On Wed, Oct 01, 2014 at 04:18:02PM -0500, Mark Tinguely wrote: > >>Commit "xfs: remove all the inodes on a buffer from the AIL in bulk" > >>made the xfs inode flush callback more efficient by combining all > >>the inode writes on the buffer and the deletions of the inode log > >>item from AIL. > >> > >>The initial loop in this patch should be looping through all > >>the log items on the buffer to see which items have > >>xfs_iflush_done as their callback function. But currently, > >>only the log item passed to the function has its callback > >>compared to xfs_iflush_done. If the log item pointer passed to > >>the function does have the xfs_iflush_done callback function, > >>then all the log items on the buffer are removed from the > >>li_bio_list on the buffer b_fspriv and could be removed from > >>the AIL eventhough they may have not been written yet. > > > >Looks like a bug, but what I don't know from this description is > >the symptoms and impact of this bug being hit? Is there a risk of > >filesystem corruption on crash or power loss? Perhaps it's a data > >loss issue? I can't tell, and for anyone scanning the commit logs to > >determine if they need to backport the fix will be asking the same > >questions. > > > >Also, is there a reproducable test case for it? > > I was looking in this code for a way an inode could be removed from > the AIL but not written to disk. > > I have a metadata dump that shows a truncate on two inodes just > before a clean unmount. The free space btrees are updated but > neither inode has been updated. A clean shutdown will wait until the > AIL is empty, And until all the inodes are reclaimed and buffers written. If the inode is dirty in reclaim (i.e. has outstanding dirty fields in the log item) then it will be written - the presence in the AIL is not necessary for this to occur. These dirty flags are only cleared when the inode is flushed to the buffer and attached to it, so the presence of these fields assumes the inode is in the AIL. i.e. if reclaim didn't write the inodes out, it means the last operation that happened on these inodes was xfs_iflush() and they were cleaned and written to disk correctly. > so something removed the inode from the AIL but did > not write the latest changes to disk. Yes, there were earlier > changes to inode/chunk in previous pushes. Let's look at this from a different perspective: how does the inode get attached to the buffer in the first place so it can be removed from the AIL in xfs_iflush_done()? The only way inodes get attached to the buffer through xfs_buf_attach_iodone() via either xfs_iflush() or xfs_ifree_cluster(). xfs_iflush() is only called when an inode is either being pushed from the AIL or being reclaimed and it is dirty. xfs_ifree_cluster() is only called when an inode chunk is being deallocated. Hence, any inode on the buffer iodone list is either currently being written or the extent it lies in has just been freed and it will never get written. When an inode chunk is being freed via xfs_ifree_cluster(), all the inodes attached to the buffer have their iodone functions changed to xfs_istale_done(). Hence we have the situation where all the inodes on a buffer have the same callbacks: either xfs_istale_done() or xfs_iflush_done(). And on buffer IO completion, all the inodes on the buffer should be cleaned at the same time, and hence all should have their completions run. If the inode is modified by another transaction while the push is in progress, the inode will be moved forward in the AIL. It will have a more recent LSN than the flush in progress. Hence this check in xfs_iflush_done() when determining if the inode should be removed from the AIL: if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) { log_items[i++] = blip; } i.e. it only removes the item from the AIL if the flush LSN is unchanged from when the inode was flushed to the buffer. Hence I can't see how the existing code can result in removing a dirty inode from the AIL without it being written to the buffer first (and hence cleaned), nor if such a problem occurred how such a dirty inode can make it through reclaim during unmount without being seen as dirty and written to disk. I agree that we should fix the code to be exactly correct, but I can't see how the code as is stands behaves any differently from the fixed version, nor how it would be responsible for the problem in the metadump you analysed.... I'll rework the commit message appropriately. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs