On Wed 12-10-11 10:44:36, Wu Fengguang wrote: > On Wed, Oct 12, 2011 at 05:53:59AM +0800, Jan Kara wrote: > > On Tue 11-10-11 10:36:38, Wu Fengguang wrote: > > > On Tue, Oct 11, 2011 at 07:30:07AM +0800, Jan Kara wrote: > > > > On Mon 10-10-11 19:31:30, Wu Fengguang wrote: > > > > > On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote: > > > > > > Hi Fengguang, > > > > > > > > > > > > On Sat 08-10-11 12:00:36, Wu Fengguang wrote: > > > > > > > The test results look not good: btrfs is heavily impacted and the > > > > > > > other filesystems are slightly impacted. > > > > > > > > > > > > > > I'll send you the detailed logs in private emails (too large for the > > > > > > > mailing list). Basically I noticed many writeback_wait traces that never > > > > > > > appear w/o this patch. > > > > > > OK, thanks for running these tests. I'll have a look at detailed logs. > > > > > > I guess the difference can be caused by changes in redirty/requeue logic in > > > > > > the second patch (the changes in the first patch could possibly make > > > > > > several writeback_wait events from one event but never could introduce new > > > > > > events). > > > > > > > > > > > > I guess I'll also try to reproduce the problem since it should be pretty > > > > > > easy when you see such a huge regression even with 1 dd process on btrfs > > > > > > filesystem. > > > > > > > > > > > > > In the btrfs cases that see larger regressions, I see large fluctuations > > > > > > > in the writeout bandwidth and long disk idle periods. It's still a bit > > > > > > > puzzling how all these happen.. > > > > > > Yes, I don't understand it yet either... > > > > > > > > > > Jan, it's obviously caused by this chunk, which is not really > > > > > necessary for fixing Christoph's problem. So the easy way is to go > > > > > ahead without this chunk. > > > > Yes, thanks a lot for debugging this! I'd still like to understand why > > > > the hunk below is causing such a big problem to btrfs. I was looking into > > > > the traces and all I could find so far was that for some reason relevant > > > > inode (ino 257) was not getting queued for writeback for a long time (e.g. > > > > 20 seconds) which introduced disk idle times and thus bad throughput. But I > > > > don't understand why the inode was not queue for such a long time yet... > > > > Today it's too late but I'll continue with my investigation tomorrow. > > > > > > Yeah, I have exactly the same observation and puzzle.. > > OK, I dug more into this and I think I found an explanation. The problem > > starts at > > flush-btrfs-1-1336 [005] 20.688011: writeback_start: bdi btrfs-1: > > sb_dev 0:0 nr_pages=23685 sync_mode=0 kupdate=1 range_cyclic=1 background=0 > > reason=periodic > > in the btrfs trace you sent me when we start "kupdate" style writeback > > for bdi "btrfs-1". This work then blocks flusher thread upto moment: > > flush-btrfs-1-1336 [007] 45.707479: writeback_start: bdi btrfs-1: > > sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0 > > reason=periodic > > flush-btrfs-1-1336 [007] 45.707479: writeback_written: bdi btrfs-1: > > sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0 > > reason=periodic > > > > (i.e. for 25 seconds). The reason why this work blocks flusher thread for > > so long is that btrfs has "btree inode" - essentially an inode holding > > filesystem metadata and btrfs ignores any ->writepages() request for this > > inode coming from kupdate style writeback. So we always try to write this > > inode, make no progress, requeue inode (as it has still mapping tagged as > > dirty), see that b_more_io is nonempty so we sleep for a while and then > > retry. We do not include inode 257 with real dirty data into writeback > > because this is kupdate style writeback and inode 257 does not have dirty > > timestamp old enough. This loop would break either after 30s when inode > > with data becomes old enough or - as we see above - at the moment when > > btrfs decided to do transaction commit and cleaned metadata inode by it's > > own methods. In either case this is far too late... > > Yes indeed. Good catch! > > The implication of this case is, never put an inode to b_more_io > unless made some progress on cleaning some pages or the metadata. Well, it is not so simple. The problem really is a lack of information propagation from filesystem to writeback code. It may well happen (similar to XFS case Christoph described), that we failed to write anything because of lock contention or similar rather temporary reason. In that case, retrying via b_more_io would be fine. On the other hand it may be case that we never succeed to write anything from the inode like btrfs shows, and finally, it may even be the case that even though we succeed to write something the inode remains dirty e.g. because it is metadata inode which gets dirty on each filesystem change. In this light, even current code which does requeue_io() when nr_to_write <= 0 is in theory livelockable by a constantly dirty inode which has always enough pages to write. But so far noone observed this condition in practice so I wouldn't be too worried... > Failing to do so will lead to > > - busy looping (which can be fixed by patch 1/2 "writeback: Improve busyloop prevention") > > - block the current work (and in turn the other queued works) for long time, > where the other pending works may tend to work on a different set of > inodes or have different criteria for the FS to make progress. The > existing examples are the for_kupdate test in btrfs and the SYNC vs > ASYNC tests in general. And I'm planning to send writeback works > from the vmscan code to write a specific inode.. > > In this sense, it looks not the right direction to convert the > redirty_tail() calls to requeue_io(). Well, there has to be some balance. Surely giving up on inodes where we have problems with doing writeback easily prevents livelocks but on the other hand it introduces the type of problems Christoph saw with XFS - some inodes can get starved from writeback. > If we change redirty_tail() to the earlier proposed requeue_io_wait(), > all the known problems can be solved nicely. Right, so I see that using requeue_io_wait() will avoid livelocks and the possibility of starvation will be lower than with redirty_tail(). But what I don't like about your implementation is the uncertainty when writeback will be retried (using requeue_io_wait() may effectively mean putting off writeback of the inode for dirty_writeback_interval = 5 seconds) and also more subtle logic of list handling. But you inspired me how we could maybe implement requeue_io_wait() logic without these problems: When we have writeback work which cannot make progress, we could check whether there's other work to do (plus if we are doing for_kupdate work, we'll check whether for_background work would be needed) and if yes, we just finish current work and continue with the new work. Then requeue_io() for inode which cannot make progress will effectively work like requeue_io_wait() but we'll get well defined retry times and no additional list logic. What do you think? I'll send updated patches in a moment... > > So for now I don't see a better alternative than to revert to old > > behavior in writeback_single_inode() as you suggested earlier. That way we > > would redirty_tail() inodes which we cannot clean and thus they won't cause > > livelocking of kupdate work. > > requeue_io_wait() can equally avoid touching inode->dirtied_when :) > > > Longer term we might want to be more clever in > > switching away from kupdate style writeback to pure background writeback > > but it's not yet clear to me what the logic should be so that we give > > enough preference to old inodes... > > We'll need to adequately update older_than_this in the wb_writeback() > loop for background work. Then we can make the switch. > > > @@ -583,10 +597,10 @@ static long writeback_sb_inodes(struct super_block *sb, > > wrote++; > > if (wbc.pages_skipped) { > > /* > > - * writeback is not making progress due to locked > > - * buffers. Skip this inode for now. > > + * Writeback is not making progress due to unavailable > > + * fs locks or similar condition. Retry in next round. > > */ > > - redirty_tail(inode, wb); > > + requeue_io(inode, wb); > > } > > spin_unlock(&inode->i_lock); > > spin_unlock(&wb->list_lock); > > In the case writeback_single_inode() just redirty_tail()ed the inode, > it's not good to requeue_io() it here. So I'd suggest to keep the > original code, or remove the if(pages_skipped){} block totally. Good point. I'll just remove that if. Hmm, which will make pages_skipped unused so we can remove them altogether. I can post that as a separate patch. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html