On Wed 09-10-13 16:02:11, Jan Kara wrote: > On Tue 08-10-13 15:14:09, Andrew Morton wrote: > > On Fri, 27 Sep 2013 11:44:40 +0200 Jan Kara <jack@xxxxxxx> wrote: > > > > > When there are processes heavily creating small files while sync(2) is > > > running, it can easily happen that quite some new files are created > > > between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen > > > especially if there are several busy filesystems (remember that sync > > > traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one > > > fs before starting it on another fs). Because WB_SYNC_ALL pass is slow > > > (e.g. causes a transaction commit and cache flush for each inode in > > > ext3), resulting sync(2) times are rather large. > > > > > > The following script reproduces the problem: > > > > > > function run_writers > > > { > > > for (( i = 0; i < 10; i++ )); do > > > mkdir $1/dir$i > > > for (( j = 0; j < 40000; j++ )); do > > > dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null > > > done & > > > done > > > } > > > > > > for dir in "$@"; do > > > run_writers $dir > > > done > > > > > > sleep 40 > > > time sync > > > ====== > > > > > > Fix the problem by disregarding inodes dirtied after sync(2) was called > > > in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes a > > > time stamp when sync has started which is used for setting up work for > > > flusher threads. > > > > > > To give some numbers, when above script is run on two ext4 filesystems on > > > simple SATA drive, the average sync time from 10 runs is 267.549 seconds > > > with standard deviation 104.799426. With the patched kernel, the average > > > sync time from 10 runs is 2.995 seconds with standard deviation 0.096. > > > > We need to be really careful about this - it's easy to make mistakes > > and the consequences are nasty. > Agreed. Nasty and hard to notice. > > > > --- a/fs/fs-writeback.c > > > +++ b/fs/fs-writeback.c > > > @@ -39,7 +39,7 @@ > > > struct wb_writeback_work { > > > long nr_pages; > > > struct super_block *sb; > > > - unsigned long *older_than_this; > > > + unsigned long older_than_this; > > > enum writeback_sync_modes sync_mode; > > > unsigned int tagged_writepages:1; > > > unsigned int for_kupdate:1; > > > @@ -248,8 +248,7 @@ static int move_expired_inodes(struct list_head *delaying_queue, > > > > > > while (!list_empty(delaying_queue)) { > > > inode = wb_inode(delaying_queue->prev); > > > - if (work->older_than_this && > > > - inode_dirtied_after(inode, *work->older_than_this)) > > > + if (inode_dirtied_after(inode, work->older_than_this)) > > > break; > > > list_move(&inode->i_wb_list, &tmp); > > > moved++; > > > @@ -791,12 +790,11 @@ static long wb_writeback(struct bdi_writeback *wb, > > > { > > > unsigned long wb_start = jiffies; > > > long nr_pages = work->nr_pages; > > > - unsigned long oldest_jif; > > > struct inode *inode; > > > long progress; > > > > > > - oldest_jif = jiffies; > > > - work->older_than_this = &oldest_jif; > > > + if (!work->older_than_this) > > > + work->older_than_this = jiffies; > > > > So wb_writeback_work.older_than_this==0 has special (and undocumented!) > > meaning. But 0 is a valid jiffies value (it occurs 5 minutes after > > boot, too). What happens? > > > > If the caller passed in "jiffies" at that time, things will presumably > > work, by luck, because we'll overwrite the caller's zero with another > > zero. Most of the time - things might go wrong if jiffies increments > > to 1. > > > > But what happens if the caller was kupdate, exactly 330 seconds after > > boot? Won't we overwrite the caller's "older than 330 seconds" with > > "older than 300 seconds" (or something like that)? > > > > If this has all been thought through then let's explain how it works, > > please. > Yes, I was thinking about this and consequences seemed harmless to me. If > the submitter of 'work' sets older_than_this but by coincidence it ends up > being 0, we reset older_than_this to 'jiffies' in wb_writeback(). That can > result in writing more than we strictly need but that doesn't cause any > harm (except for possibly slower execution of that work item). > > > Perhaps it would be better to just stop using the > > wb_writeback_work.older_than_this==0 magic sentinel and add a new > > older_than_this_is_set:1 to the wb_writeback_work. > You are right it would be a cleaner solution. OTOH it creates a > possibility for someone to set older_than_this but forget to set > older_than_this_is_set. But all users are localized in fs/fs-writeback.c > and we don't create new users that often. So I guess it's OK. > > Attached is a patch which implements what you've asked for. Either fold it > into my previous patch or carry it separately. I don't really mind. Thanks > for having a look at my patch. I found out I've sent out an older version of the patch without a comment in struct wb_writeback_work. So here's a newer version. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR
>From 27d1017e2f7dd0e4a40f9ff38926a6a11cdd5cfb Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@xxxxxxx> Date: Wed, 9 Oct 2013 15:41:50 +0200 Subject: [PATCH] writeback: Use older_than_this_is_set instead of magic older_than_this == 0 Currently we use 0 as a special value of work->older_than_this to indicate that wb_writeback() should set work->older_that_this to current time. This works but it is a bit magic. So use a special flag in work_struct for that. Also fixup writeback from workqueue rescuer to include all inodes. Signed-off-by: Jan Kara <jack@xxxxxxx> --- fs/fs-writeback.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 70837dadad72..65e66caec76f 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -39,6 +39,10 @@ struct wb_writeback_work { long nr_pages; struct super_block *sb; + /* + * Write only inodes dirtied before this time. Don't forget to set + * older_than_this_is_set when you set this. + */ unsigned long older_than_this; enum writeback_sync_modes sync_mode; unsigned int tagged_writepages:1; @@ -46,6 +50,7 @@ struct wb_writeback_work { unsigned int range_cyclic:1; unsigned int for_background:1; unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ + unsigned int older_than_this_is_set:1; enum wb_reason reason; /* why was writeback initiated? */ struct list_head list; /* pending work list */ @@ -732,6 +737,8 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, .sync_mode = WB_SYNC_NONE, .range_cyclic = 1, .reason = reason, + .older_than_this = jiffies, + .older_than_this_is_set = 1, }; spin_lock(&wb->list_lock); @@ -793,7 +800,7 @@ static long wb_writeback(struct bdi_writeback *wb, struct inode *inode; long progress; - if (!work->older_than_this) + if (!work->older_than_this_is_set) work->older_than_this = jiffies; spin_lock(&wb->list_lock); @@ -1356,6 +1363,7 @@ void sync_inodes_sb(struct super_block *sb, unsigned long older_than_this) .sync_mode = WB_SYNC_ALL, .nr_pages = LONG_MAX, .older_than_this = older_than_this, + .older_than_this_is_set = 1, .range_cyclic = 0, .done = &done, .reason = WB_REASON_SYNC, -- 1.8.1.4