On Tue 08-10-13 15:14:09, Andrew Morton wrote: > On Fri, 27 Sep 2013 11:44:40 +0200 Jan Kara <jack@xxxxxxx> wrote: > > > When there are processes heavily creating small files while sync(2) is > > running, it can easily happen that quite some new files are created > > between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen > > especially if there are several busy filesystems (remember that sync > > traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one > > fs before starting it on another fs). Because WB_SYNC_ALL pass is slow > > (e.g. causes a transaction commit and cache flush for each inode in > > ext3), resulting sync(2) times are rather large. > > > > The following script reproduces the problem: > > > > function run_writers > > { > > for (( i = 0; i < 10; i++ )); do > > mkdir $1/dir$i > > for (( j = 0; j < 40000; j++ )); do > > dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null > > done & > > done > > } > > > > for dir in "$@"; do > > run_writers $dir > > done > > > > sleep 40 > > time sync > > ====== > > > > Fix the problem by disregarding inodes dirtied after sync(2) was called > > in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes a > > time stamp when sync has started which is used for setting up work for > > flusher threads. > > > > To give some numbers, when above script is run on two ext4 filesystems on > > simple SATA drive, the average sync time from 10 runs is 267.549 seconds > > with standard deviation 104.799426. With the patched kernel, the average > > sync time from 10 runs is 2.995 seconds with standard deviation 0.096. > > We need to be really careful about this - it's easy to make mistakes > and the consequences are nasty. Agreed. Nasty and hard to notice. > > --- a/fs/fs-writeback.c > > +++ b/fs/fs-writeback.c > > @@ -39,7 +39,7 @@ > > struct wb_writeback_work { > > long nr_pages; > > struct super_block *sb; > > - unsigned long *older_than_this; > > + unsigned long older_than_this; > > enum writeback_sync_modes sync_mode; > > unsigned int tagged_writepages:1; > > unsigned int for_kupdate:1; > > @@ -248,8 +248,7 @@ static int move_expired_inodes(struct list_head *delaying_queue, > > > > while (!list_empty(delaying_queue)) { > > inode = wb_inode(delaying_queue->prev); > > - if (work->older_than_this && > > - inode_dirtied_after(inode, *work->older_than_this)) > > + if (inode_dirtied_after(inode, work->older_than_this)) > > break; > > list_move(&inode->i_wb_list, &tmp); > > moved++; > > @@ -791,12 +790,11 @@ static long wb_writeback(struct bdi_writeback *wb, > > { > > unsigned long wb_start = jiffies; > > long nr_pages = work->nr_pages; > > - unsigned long oldest_jif; > > struct inode *inode; > > long progress; > > > > - oldest_jif = jiffies; > > - work->older_than_this = &oldest_jif; > > + if (!work->older_than_this) > > + work->older_than_this = jiffies; > > So wb_writeback_work.older_than_this==0 has special (and undocumented!) > meaning. But 0 is a valid jiffies value (it occurs 5 minutes after > boot, too). What happens? > > If the caller passed in "jiffies" at that time, things will presumably > work, by luck, because we'll overwrite the caller's zero with another > zero. Most of the time - things might go wrong if jiffies increments > to 1. > > But what happens if the caller was kupdate, exactly 330 seconds after > boot? Won't we overwrite the caller's "older than 330 seconds" with > "older than 300 seconds" (or something like that)? > > If this has all been thought through then let's explain how it works, > please. Yes, I was thinking about this and consequences seemed harmless to me. If the submitter of 'work' sets older_than_this but by coincidence it ends up being 0, we reset older_than_this to 'jiffies' in wb_writeback(). That can result in writing more than we strictly need but that doesn't cause any harm (except for possibly slower execution of that work item). > Perhaps it would be better to just stop using the > wb_writeback_work.older_than_this==0 magic sentinel and add a new > older_than_this_is_set:1 to the wb_writeback_work. You are right it would be a cleaner solution. OTOH it creates a possibility for someone to set older_than_this but forget to set older_than_this_is_set. But all users are localized in fs/fs-writeback.c and we don't create new users that often. So I guess it's OK. Attached is a patch which implements what you've asked for. Either fold it into my previous patch or carry it separately. I don't really mind. Thanks for having a look at my patch. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR
>From c1c60bd8e655a7db872473ad7436d957e3a7d5fd Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@xxxxxxx> Date: Wed, 9 Oct 2013 15:41:50 +0200 Subject: [PATCH] writeback: Use older_than_this_is_set instead of magic older_than_this == 0 Currently we use 0 as a special value of work->older_than_this to indicate that wb_writeback() should set work->older_that_this to current time. This works but it is a bit magic. So use a special flag in work_struct for that. Also fixup writeback from workqueue rescuer to include all inodes. Signed-off-by: Jan Kara <jack@xxxxxxx> --- fs/fs-writeback.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 70837dadad72..f3871e5c61aa 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -46,6 +46,7 @@ struct wb_writeback_work { unsigned int range_cyclic:1; unsigned int for_background:1; unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ + unsigned int older_than_this_is_set:1; enum wb_reason reason; /* why was writeback initiated? */ struct list_head list; /* pending work list */ @@ -732,6 +733,8 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, .sync_mode = WB_SYNC_NONE, .range_cyclic = 1, .reason = reason, + .older_than_this = jiffies, + .older_than_this_is_set = 1, }; spin_lock(&wb->list_lock); @@ -793,7 +796,7 @@ static long wb_writeback(struct bdi_writeback *wb, struct inode *inode; long progress; - if (!work->older_than_this) + if (!work->older_than_this_is_set) work->older_than_this = jiffies; spin_lock(&wb->list_lock); @@ -1356,6 +1359,7 @@ void sync_inodes_sb(struct super_block *sb, unsigned long older_than_this) .sync_mode = WB_SYNC_ALL, .nr_pages = LONG_MAX, .older_than_this = older_than_this, + .older_than_this_is_set = 1, .range_cyclic = 0, .done = &done, .reason = WB_REASON_SYNC, -- 1.8.1.4