On Fri, 27 Sep 2013 11:44:40 +0200 Jan Kara <jack@xxxxxxx> wrote: > When there are processes heavily creating small files while sync(2) is > running, it can easily happen that quite some new files are created > between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen > especially if there are several busy filesystems (remember that sync > traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one > fs before starting it on another fs). Because WB_SYNC_ALL pass is slow > (e.g. causes a transaction commit and cache flush for each inode in > ext3), resulting sync(2) times are rather large. > > The following script reproduces the problem: > > function run_writers > { > for (( i = 0; i < 10; i++ )); do > mkdir $1/dir$i > for (( j = 0; j < 40000; j++ )); do > dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null > done & > done > } > > for dir in "$@"; do > run_writers $dir > done > > sleep 40 > time sync > ====== > > Fix the problem by disregarding inodes dirtied after sync(2) was called > in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes a > time stamp when sync has started which is used for setting up work for > flusher threads. > > To give some numbers, when above script is run on two ext4 filesystems on > simple SATA drive, the average sync time from 10 runs is 267.549 seconds > with standard deviation 104.799426. With the patched kernel, the average > sync time from 10 runs is 2.995 seconds with standard deviation 0.096. We need to be really careful about this - it's easy to make mistakes and the consequences are nasty. > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -39,7 +39,7 @@ > struct wb_writeback_work { > long nr_pages; > struct super_block *sb; > - unsigned long *older_than_this; > + unsigned long older_than_this; > enum writeback_sync_modes sync_mode; > unsigned int tagged_writepages:1; > unsigned int for_kupdate:1; > @@ -248,8 +248,7 @@ static int move_expired_inodes(struct list_head *delaying_queue, > > while (!list_empty(delaying_queue)) { > inode = wb_inode(delaying_queue->prev); > - if (work->older_than_this && > - inode_dirtied_after(inode, *work->older_than_this)) > + if (inode_dirtied_after(inode, work->older_than_this)) > break; > list_move(&inode->i_wb_list, &tmp); > moved++; > @@ -791,12 +790,11 @@ static long wb_writeback(struct bdi_writeback *wb, > { > unsigned long wb_start = jiffies; > long nr_pages = work->nr_pages; > - unsigned long oldest_jif; > struct inode *inode; > long progress; > > - oldest_jif = jiffies; > - work->older_than_this = &oldest_jif; > + if (!work->older_than_this) > + work->older_than_this = jiffies; So wb_writeback_work.older_than_this==0 has special (and undocumented!) meaning. But 0 is a valid jiffies value (it occurs 5 minutes after boot, too). What happens? If the caller passed in "jiffies" at that time, things will presumably work, by luck, because we'll overwrite the caller's zero with another zero. Most of the time - things might go wrong if jiffies increments to 1. But what happens if the caller was kupdate, exactly 330 seconds after boot? Won't we overwrite the caller's "older than 330 seconds" with "older than 300 seconds" (or something like that)? If this has all been thought through then let's explain how it works, please. Perhaps it would be better to just stop using the wb_writeback_work.older_than_this==0 magic sentinel and add a new older_than_this_is_set:1 to the wb_writeback_work. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html