Hi Jan, I just had a tester report to me that he'd bisected an XFS assert failure on unmount to this commit: # git bisect bad c4a391b53a72d2df4ee97f96f78c1d5971b47489 is the first bad commit commit c4a391b53a72d2df4ee97f96f78c1d5971b47489 Author: Jan Kara <jack@xxxxxxx> Date: Tue Nov 12 15:07:51 2013 -0800 writeback: do not sync data dirtied after sync start When there are processes heavily creating small files while sync(2) is running, it can easily happen that quite some new files are created between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen especially if there are several busy filesystems (remember that sync traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one fs before starting it on another fs). Because WB_SYNC_ALL pass is slow (e.g. causes a transaction commit and cache flush for each inode in ext3), resulting sync(2) times are rather large The XFS assert failure was that on unmount there were delayed allocation blocks accounted to an inode when it was being evicted. i.e. during the evict_inodes() call in generic_shutdown_super() after the filesystem had been synced. There should be no dirty data at this point in time.... The writeback code is a maze of twisty passages, so I'm not 100% sure of my reasoning - the call chain analysis I did is below so you can confirm I didn't miss anything. Assuming I got it right, however.... What it looks like is that there's a problem with requeue_inode() behaviour and redirty_page_for_writepage(). If write_cache_pages() runs out of it's writeback chunk, the inode is still dirty when it returns to writeback_sb_inodes(). This leads to requeue_inode() updating the dirtied_when field in the inode for WB_SYNC_NONE + wbc->tagged_writepages, then calling redirty_tail() because wbc->pages_skipped is set. That puts the inode back on b_dirty with a timestamp after the sync started, and so the WB_SYNC_ALL pass skips it, hence it doesn't get synced to disk even though it should. IOWs, sync_filesystem can silently fail to write dirty data to disk. In this case, XFS is skipping pages because it can't get the inode metadata lock without blocking on it, and we are in non-blocking mode because we are doing WB_SYNC_NONE writeback. We could also check for wbc->tagged_writepages, but nobody else does, nor does it fix the problem of calling redirty_page_for_writepage() in WB_SYNC_ALL conditions. None of the filesystems put writeback control specific contraints on their calls to redirty_page_for_writepage() and so it seems to me like it's a generic issue, not just an XFS issue. Digging deeper, it looks to me like our sync code simply does not handle redirty_page_for_writepage() being called in WB_SYNC_ALL properly. It looks to me like requeue_inode should never rewrite the timestamp of the inode if we skipped pages, because that means we didn't write everything we were supposed to for WB_SYNC_ALL or wbc->tagged_writepages writeback. Further, if there are skipped pages we should be pushing the inode to b_more_io, not b_dirty so as to do another pass on the inode to ensure we writeback the skipped pages in this writeback pass regardless of the WB_SYNC flags or wbc->tagged_writepages field. What do you think, Jan? Cheers, Dave. Call chain analysis: unmount generic_drop_super() sync_filesystem __sync_filesystem(sb, 0, start_time) writeback_inodes_sb(sb, WB_REASON_SYNC); writeback_inodes_sb_nr(sb, (all dirty pages), WB_REASON_SYNC); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_NONE, .tagged_writepages = 1, .... bdi_queue_work(sb->s_bdi, &work); .... wait_for_completion() __sync_filesystem(sb, 1, start_time) bdi flusher: bdi_writeback_workfn wb_do_writeback(wb); wb_writeback(wb, work); writeback_sb_inodes(work->sb, wb, work); struct writeback_control wbc = { .sync_mode = work->sync_mode, .tagged_writepages = work->tagged_writepages, ..... wbc.nr_to_write = write_chunk; ..... __writeback_single_inode(inode, &wbc); { do_writepages(mapping, wbc); generic_writepages(mapping, wbc); write_cache_pages(mapping, wbc, __writepage, mapping) { tag_pages_for_writeback(PAGECACHE_TAG_TOWRITE); xfs_vm_writepage(page, wbc, mapping); xfs_map_blocks(nonblocking = 1); { if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) { if (nonblocking) return -XFS_ERROR(EAGAIN); } if (err) goto error; ...... if (err == EAGAIN) got redirty; ..... redirty_page_for_writepage(wbc, page); wbc->pages_skipped++; } .... if (wbc->nr_to_write-- <= 0 && SYNC_NONE) { done = 1; break; } } // exit write_cache_pages with wbc->pages_skipped > 0 and // inode is still dirty } // back up to writeback_sb_inodes() now, and it does: requeue_inode(inode, wb, &wbc); .... /* * Sync livelock prevention. Each inode is tagged and synced in one * shot. If still dirty, it will be redirty_tail()'ed below. Update * the dirty time to prevent enqueue and sync it again. */ if ((inode->i_state & I_DIRTY) && (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) inode->dirtied_when = jiffies; if (wbc->pages_skipped) { /* * writeback is not making progress due to locked * buffers. Skip this inode for now. */ redirty_tail(inode, wb); return; } redirty_tail(inode, wb); list_move(&inode->i_wb_list, &wb->b_dirty); -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html