Re: [PATCH 1/2] writeback: Improve busyloop prevention

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Wed, 19 Oct 2011 21:25:40 +0800




On Wed, Oct 19, 2011 at 07:56:30PM +0800, Jan Kara wrote:
> On Tue 18-10-11 22:35:04, Wu Fengguang wrote:
> > On Tue, Oct 18, 2011 at 08:51:28AM +0800, Jan Kara wrote:
> > > On Sat 15-10-11 00:28:07, Wu Fengguang wrote:
> > > > On Sat, Oct 15, 2011 at 12:00:47AM +0800, Wu Fengguang wrote:
> > > > > On Fri, Oct 14, 2011 at 04:18:35AM +0800, Jan Kara wrote:
> > > > > > On Thu 13-10-11 22:39:39, Wu Fengguang wrote:
> > > > > > > > > +       long pause = 1;
> > > > > > > > > +       long max_pause = dirty_writeback_interval ?
> > > > > > > > > +                          msecs_to_jiffies(dirty_writeback_interval * 10) :
> > > > > > > > > +                          HZ;
> > > > > > > >
> > > > > > > > It's better not to put the flusher to sleeps more than 10ms, so that
> > > > > > > > when the condition changes, we don't risk making the storage idle for
> > > > > > > > too long time.
> > > > > > >
> > > > > > > Yeah, the one big regression case
> > > > > > >
> > > > > > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue6+
> > > > > > > ------------------------  ------------------------
> > > > > > >                    47.07       -15.5%        39.78  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
> > > > > > >
> > > > > > > is exactly caused by the large sleep: the attached graphs are showing
> > > > > > > one period of no-progress on the number of written pages.
> > > > > >   Thanks for the tests! Interesting. Do you have trace file from that run?
> > > > > > I see the writeback stalled for 20s or so which is more than
> > > > > > dirty_writeback_centisecs so I think something more complicated must have
> > > > > > happened.
> > > > >
> > > > > I noticed that
> > > > >
> > > > > 1) the global dirty limit is exceeded (dirty=286, limit=256), hence
> > > > >    the dd tasks are hard blocked in balance_dirty_pages().
> > > > >
> > > > >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > > >
> > > > > 2) the flusher thread is not woken up because we test writeback_in_progress()
> > > > >    in balance_dirty_pages().
> > > > >
> > > > >                 if (unlikely(!writeback_in_progress(bdi)))
> > > > >                         bdi_start_background_writeback(bdi);
> > > > >
> > > > > Thus the flusher thread wait and wait as in below trace.
> > > > >
> > > > >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > > >        flush-8:0-1170  [004]   211.068428: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > > > >        flush-8:0-1170  [004]   211.068428: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >        flush-8:0-1170  [004]   211.068440: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=9 index=0 to_write=1024 wrote=0
> > > > >        flush-8:0-1170  [004]   211.068442: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >        flush-8:0-1170  [004]   211.068443: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >
> > > > >        flush-8:0-1170  [004]   213.110122: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > > >        flush-8:0-1170  [004]   213.110126: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > > > >        flush-8:0-1170  [004]   213.110126: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >        flush-8:0-1170  [004]   213.110134: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=11 index=0 to_write=1024 wrote=0
> > > > >        flush-8:0-1170  [004]   213.110135: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >        flush-8:0-1170  [004]   213.110135: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >
> > > > >        flush-8:0-1170  [004]   217.193470: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > > >        flush-8:0-1170  [004]   217.193471: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > > > >        flush-8:0-1170  [004]   217.193471: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >        flush-8:0-1170  [004]   217.193483: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=15 index=0 to_write=1024 wrote=0
> > > > >        flush-8:0-1170  [004]   217.193485: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > >
> > > > It's still puzzling why dirty pages remain at 286 and does not get
> > > > cleaned by either flusher threads for local XFS and NFSROOT for so
> > > > long time..
> > >   I was looking at this as well. So the reason why pages were not cleaned
> > > by the flusher thread is that there were 2 dirty inodes and the inode with
> > > dirty pages had i_dirtied_whan newer than the time when we started this
> > > background writeback. Thus the running background writeback work always
> > > included only the other inode which has no dirty pages but I_DIRTY_SYNC set.
> >
> > Yes that's very likely, given that the background work can run for very
> > long time and there are still code paths to redirty_tail() the inode.
> >
> > This sounds horrible -- as time goes by, more and more inodes could be
> > excluded from the background writeback due to inode->dirtied_when
> > being touched undesirably.
>   Yes, but it's not only about touching inode->i_dirtied_when. It can also
> be the case that the inode started to be dirty only after we started the
> writeback. For example in your run, I'm pretty confident from the traces
> that that's what happened.

Yeah, that's possible because the dirty threshold is merely 1MB, and
one inode could be cleaned totally (in the meanwhile the dd is throttled)
and then made dirty again.

> This behavior can happen for a long time (I
> think I introduced it by 7624ee72aa09334af072853457a5d46d9901c3f8) and is a
> result of how we do livelock avoidance using older_than_this (or wb_start
> previously).

I'd say that commit itself is towards the right direction.
Unfortunately the writeback logic is a bit more twisted than our vision..

> Just when you don't have long-running work that cannot
> progress, it will be mostly hidden (only might be observable as a small
> inefficiency of background writeback).

Yes.

> > > Apparently XFS is stubborn and refuses to write the inode although we try
> > > rather hard. That is probably because dd writing to this inode is stuck in
> > > balance_dirty_pages() and holds ilock - which is a bit unfortunate behavior
> > > but what can we do...
> >
> > Yeah, and there may well be other known or unknown long lived blocking
> > cases. What we can do in VFS is to be not so picky on these conditions...
>   True.
> 
> > To be frank I still like the requeue_io_wait() approach. It can
> > trivially replace _all_ redity_tail() calls and hence avoid touching
> > inode->dirtied_when -- keeping that time stamp sane is the one thing I
> > would really like to do.
>   As I pointed out above, it's not only about redirty_tail() so I belive we
> should solve this problem regardless whether we use requeue_io_wait() or

Good point. Yeah we need to make the background work accept newly
dirtied indoes. This could make the flusher less concentrated on the
elder pages, but there seems no obvious right solution...

> other busyloop prevention - do you agree with the attached patch? If yes,
> add it to your patch queue for the next merge window please.

OK, I'll think about it.

>   BTW: I think you cannot avoid _all_ redirty_tail() calls - in situation
> where inode was really redirtied you need to really redirty_tail() it. But
> I agree you can remove all redirty_tail() calls which are there because we
> failed to make any progress with the inode.

That's right. We only convert redirty_tail(possibly blocked inode) to
requeue_io_wait(). Well, in general. In the case of failed
grab_super_passive(), it's not even the current inode being blocked,
but _something_ get blocked.  But we still have to requeue_io_wait()
it to avoid busy retrying.

In the below case,

               if (inode->i_sb != sb) {
                        if (work->sb) {
                                /*
                                 * We only want to write back data for this
                                 * superblock, move all inodes not belonging
                                 * to it back onto the dirty list.
                                 */
                                redirty_tail(inode, wb);
                                continue;
                        }

It's better off to _not_ enqueue that inode in the first place inside
move_expired_inodes(). Then we can safely convert that redirty_tail()
to requeue_io() to skip the irrelevant inodes already in s_io/s_more_io.

> > It also narrows down the perhaps blocked inodes and hence the need to
> > do heuristic wait-and-retry to the minimal.
> >
> > It does introduce one more list, which IMHO is more tolerable than the
> > problems it fixed. And its max delay time can be reduced explicitly
> > if necessary: when there are heavy dirtiers, it will quickly be woken
> > up by balance_dirty_pages(); in other cases we can let the kupdate
> > work check s_more_io_wait before aborting to ensure 5s max delay:
> >
> > @@ -863,7 +863,7 @@ static long wb_check_old_data_flush(stru
> >
> >         expired = wb->last_old_flush +
> >                         msecs_to_jiffies(dirty_writeback_interval * 10);
> > -       if (time_before(jiffies, expired))
> > +       if (time_before(jiffies, expired) && list_empty(&wb->b_more_io_wait))
> >                 return 0;
> >
> >         wb->last_old_flush = jiffies;
>   So after seeing complications with my code I was also reconsidering your
> approach. Having additional list should be OK if we document the logic of
> all the lists in one place in detail.

Actually, if without the "requeue on I_SYNC" case, we can do without
s_more_io_wait and still be able to prevent the unnecessary changes
to dirtied_when. Because if we do requeue_io() for all inodes and
still find !progress on a full iteration of b_io, we are mostly sure
that the inodes remaining inside b_more_io is related to some block
condition and can modify the loop break condition to consider that.

> So the remaining problem I have is
> the uncertainty when writeback will be retried. Your change above
> guarantees one immediate retry in case we were not doing for_kupdate
> writeback when filesystem refused to write the inode and the next retry in
> dirty_writeback_interval which seems too late. But if we trigger kupdate
> work earlier to retry blocked inodes, your method would be a viable
> alternative - something like (completely untested) second attached patch?

Yeah! I have a similar patch for shortening the retry interval from
dirty_writeback_interval=5s to dirty_writeback_interval/10=500ms:

@@ -967,9 +967,14 @@ int bdi_writeback_thread(void *data)
                        continue;
                }

-               if (wb_has_dirty_io(wb) && dirty_writeback_interval)
-                       schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
-               else {
+               if (wb_has_dirty_io(wb) && dirty_writeback_interval) {
+                       unsigned long t;
+                       if (!list_empty(&wb->b_more_io_wait))
+                               t = msecs_to_jiffies(dirty_writeback_interval);
+                       else
+                               t = msecs_to_jiffies(dirty_writeback_interval * 10);
+                       schedule_timeout(t);
+               } else {
                        /*
                         * We have nothing to do, so can go sleep without any
                         * timeout and save power. When a work is queued or

However your adaptive sleep seems better, and it also covers the
dirty_writeback_interval=0 case.

> With that I find both approaches mostly equivalent so if your passes
> testing and you like it more, then I'm fine with that.

OK, thanks!

> > >   I think the patch you suggest in the other email does not fix the above
> > > scenario (although it is useful for reducing latency so I'll include it -
> > > thanks for it!).  Probably you were just lucky enough not to hit it in your
> > > next run. What I'd suggest is to refresh oldest_jif in wb_writeback() when
> > > we do not make any progress with writeback. Thus we allow freshly dirtied
> > > inodes to be queued when we cannot make progress with the current set of
> > > inodes. The resulting patch is attached.
> >
> > Refreshing oldest_jif should be safe for the background work. However
> > there is risk of livelock for other works, as the oldest_jif test is
> > there right for preventing livelocks..
>   Yes, I know, I just figured that refreshing oldest_jif only when we could
> not make progress (and thus we know there is no other work to do because we
> would break out of the loop in that case) is safe. But it is subtle and
> actually I've realized that refreshing oldest_jif for background writeback
> is enough so let's just do that.

OK.

> > For one thing, it will break this code:
> >
> >                 /*
> >                  * Sync livelock prevention. Each inode is tagged and synced in
> >                  * one shot. If still dirty, it will be redirty_tail()'ed below.
> >                  * Update the dirty time to prevent enqueue and sync it again.
> >                  */
> >                 if ((inode->i_state & I_DIRTY) &&
> >                     (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
> >                         inode->dirtied_when = jiffies;
>   Hum, right, but for WB_SYNC_ALL writeback filesystem should better not
> refuse to writeback any inode. Other things would break horribly. And it
> certainly is not an issue when we refresh the timestamp only for background
> writeback.

Ah yes!

> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 04cf3b9..6e909a9 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -699,8 +699,11 @@ static long wb_writeback(struct bdi_writeback *wb,
> > >     unsigned long wb_start = jiffies;
> > >     long nr_pages = work->nr_pages;
> > >     unsigned long oldest_jif;
> > > -   struct inode *inode;
> > >     long progress;
> > > +   long pause = 1;
> > > +   long max_pause = dirty_writeback_interval ?
> > > +                      msecs_to_jiffies(dirty_writeback_interval * 10) :
> > > +                      HZ;
> >
> > There seems no strong reasons to prefer dirty_writeback_interval over
> > HZ. So how about use the simpler form "max_pause = HZ"?
>   Well, dirty_writeback_interval actually makes some sense here. It is the
> interval in which user wants flusher thread to recheck dirtiness situation.
> So it defines latency user wants from flusher thread. Thus after this time
> we definitely want to break out of any work which cannot progress. If
> dirty_writeback_interval is shorter than 1s, we would break this by using
> HZ in the above code. Also I wanted to avoid introducing another magic
> value to writeback code when there is tunable which makes sense and can
> be used... I can add a comment explaining this if you want.

Good point.

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html