[BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 17 Feb 2014 15:40:47 +1100

Hi Jan,

I just had a tester report to me that he'd bisected an XFS assert
failure on unmount to this commit:

# git bisect bad
c4a391b53a72d2df4ee97f96f78c1d5971b47489 is the first bad commit
commit c4a391b53a72d2df4ee97f96f78c1d5971b47489
Author: Jan Kara <jack@xxxxxxx>
Date:   Tue Nov 12 15:07:51 2013 -0800

    writeback: do not sync data dirtied after sync start

    When there are processes heavily creating small files while sync(2) is
    running, it can easily happen that quite some new files are created
    between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
    especially if there are several busy filesystems (remember that sync
    traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
    fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
    (e.g.  causes a transaction commit and cache flush for each inode in
    ext3), resulting sync(2) times are rather large

The XFS assert failure was that on unmount there were delayed
allocation blocks accounted to an inode when it was being evicted.
i.e. during the evict_inodes() call in generic_shutdown_super() after
the filesystem had been synced. There should be no dirty data at
this point in time....

The writeback code is a maze of twisty passages, so I'm not 100%
sure of my reasoning - the call chain analysis I did is below so you
can confirm I didn't miss anything. Assuming I got it right,
however....

What it looks like is that there's a problem with requeue_inode()
behaviour and redirty_page_for_writepage(). If write_cache_pages()
runs out of it's writeback chunk, the inode is still dirty when it
returns to writeback_sb_inodes(). This leads to requeue_inode()
updating the dirtied_when field in the inode for WB_SYNC_NONE +
wbc->tagged_writepages, then calling redirty_tail() because
wbc->pages_skipped is set. That puts the inode back on b_dirty with
a timestamp after the sync started, and so the WB_SYNC_ALL pass
skips it, hence it doesn't get synced to disk even though it should.
IOWs, sync_filesystem can silently fail to write dirty data to disk.

In this case, XFS is skipping pages because it can't get the inode
metadata lock without blocking on it, and we are in non-blocking
mode because we are doing WB_SYNC_NONE writeback. We could also
check for wbc->tagged_writepages, but nobody else does, nor does it
fix the problem of calling redirty_page_for_writepage() in
WB_SYNC_ALL conditions. None of the filesystems put writeback
control specific contraints on their calls to
redirty_page_for_writepage() and so it seems to me like it's a
generic issue, not just an XFS issue.

Digging deeper, it looks to me like our sync code simply does not
handle redirty_page_for_writepage() being called in WB_SYNC_ALL
properly.  It looks to me like requeue_inode should never rewrite
the timestamp of the inode if we skipped pages, because that means
we didn't write everything we were supposed to for WB_SYNC_ALL or
wbc->tagged_writepages writeback. Further, if there are skipped
pages we should be pushing the inode to b_more_io, not b_dirty so as
to do another pass on the inode to ensure we writeback the skipped
pages in this writeback pass regardless of the WB_SYNC flags or
wbc->tagged_writepages field.

What do you think, Jan?

Cheers,

Dave.

Call chain analysis:

unmount
  generic_drop_super()
    sync_filesystem
      __sync_filesystem(sb, 0, start_time)
        writeback_inodes_sb(sb, WB_REASON_SYNC);
          writeback_inodes_sb_nr(sb, (all dirty pages), WB_REASON_SYNC);
          struct wb_writeback_work work = {                                                      .sb                     = sb,
                .sync_mode              = WB_SYNC_NONE,
                .tagged_writepages      = 1,                                               ....
            bdi_queue_work(sb->s_bdi, &work);                                    ....
	    wait_for_completion()
     __sync_filesystem(sb, 1, start_time)

bdi flusher:
  bdi_writeback_workfn
    wb_do_writeback(wb);
      wb_writeback(wb, work);
        writeback_sb_inodes(work->sb, wb, work);
        struct writeback_control wbc = {
                .sync_mode              = work->sync_mode,
                .tagged_writepages      = work->tagged_writepages,
        .....
        wbc.nr_to_write = write_chunk;
        .....
          __writeback_single_inode(inode, &wbc); {
            do_writepages(mapping, wbc);
              generic_writepages(mapping, wbc);
                write_cache_pages(mapping, wbc, __writepage, mapping) {
                  tag_pages_for_writeback(PAGECACHE_TAG_TOWRITE);
                  xfs_vm_writepage(page, wbc, mapping);
                    xfs_map_blocks(nonblocking = 1); {
                      if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
                                if (nonblocking)
                                        return -XFS_ERROR(EAGAIN);
                    }
                    if (err)
                        goto error;
                    ......
                    if (err == EAGAIN)
                        got redirty;
                    .....
                    redirty_page_for_writepage(wbc, page);
                      wbc->pages_skipped++;
                  }
                  ....
                  if (wbc->nr_to_write-- <= 0 && SYNC_NONE) {
                        done = 1;
                        break;
                  }
                }
                // exit write_cache_pages with wbc->pages_skipped > 0 and
                // inode is still dirty
          }
        // back up to writeback_sb_inodes() now, and it does:
        requeue_inode(inode, wb, &wbc);
          ....
          /*
           * Sync livelock prevention. Each inode is tagged and synced in one
           * shot. If still dirty, it will be redirty_tail()'ed below.  Update
           * the dirty time to prevent enqueue and sync it again.
           */
          if ((inode->i_state & I_DIRTY) &&
              (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
                  inode->dirtied_when = jiffies;

          if (wbc->pages_skipped) {
                  /*
                   * writeback is not making progress due to locked
                   * buffers. Skip this inode for now.
                   */
                  redirty_tail(inode, wb);
                  return;
          }

                redirty_tail(inode, wb);
                  list_move(&inode->i_wb_list, &wb->b_dirty);

-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html