Re: [patch 3/8] per backing_dev dirty and writeback page accounting

Miklos Szeredi <miklos@xxxxxxxxxx> · Tue, 13 Mar 2007 09:21:59 +0100

> > > IIUC, your problem is that there's another bdi that holds all the
> > > dirty pages, and this throttle loop never flushes pages from that
> > > other bdi and we sleep instead. It seems to me that the fundamental
> > > problem is that to clean the pages we need to flush both bdi's, not
> > > just the bdi we are directly dirtying.
> > 
> > This is what happens:
> > 
> > write fault on upper filesystem
> >   balance_dirty_pages
> >     submit write requests
> >   loop ...
> 
> Isn't this loop transferring the dirty state from the upper
> filesystem to the lower filesystem?

What this loop is doing is putting write requests in the request
queue, and in so doing transforming page state from dirty to
writeback.

> What I don't see here is how the pages on this filesystem are not
> getting cleaned if the lower filesystem is being flushed properly.

Because the lower filesystem writes back one request, but then gets
stuck in balance_dirty_pages before returning.  So the write request
is never completed.

The problem is that balance_dirty_pages is waiting for the condition
that the global number of dirty+writeback pages goes below the
threshold.  But this condition can only be satisfied if
balance_dirty_pages() returns.

> I'm probably missing something big and obvious, but I'm not
> familiar with the exact workings of FUSE so please excuse my
> ignorance....
> 
> > ------- fuse IPC ---------------
> > [fuse loopback fs thread 1]
> 
> This is the lower filesystem? Or a callback thread for
> doing the write requests to the lower filesystem?

This is the fuse daemon.  It's a normal process that reads requests
from /dev/fuse, serves these requests then writes the reply back onto
/dev/fuse.  It is usually multithreaded, so it can serve many requests
in parallel.

The loopback filesystem serves the requests by issuing the relevant
filesystem syscalls on the underlying fs.

> > read request
> > sys_write
> >   mutex_lock(i_mutex)
> >   ...
> >      balance_dirty_pages
> >         submit write requests
> >         loop ... write requests completed ... dirty still over limit ... 
> > 	... loop forever
> 
> Hmmm - the situation in balance_dirty_pages() after an attempt
> to writeback_inodes(&wbc) that has written nothing because there
> is nothing to write would be:
> 
> 	wbc->nr_write == write_chunk &&
> 	wbc->pages_skipped == 0 &&
> 	wbc->encountered_congestion == 0 &&
> 	!bdi_congested(wbc->bdi)
> 
> What happens if you make that an exit condition to the loop?

That's almost right.  The only problem is that even if there's no
congestion, the device queue can be holding a great amount of yet
unwritten pages.  So exiting on this condition would mean, that
dirty+writeback could go way over the threshold.

How much this would be a problem?  I don't know, I guess it depends on
many things: how many queues, how many requests per queue, how many
bytes per request.

> Or alternatively, adding another bit to the wbc structure to
> say "there was nothing to do" and setting that if we find
> list_empty(&sb->s_dirty) when trying to flush dirty inodes."
> 
> [ FWIW, this may also solve another problem of fast block devices
> being throttled incorrectly when a slow block dev is consuming
> all the dirty pages... ]

There may be a patch floating around, which I think basically does
this, but only as long as the dirty+writeback are over a soft limit,
but under the hard limit.

When over the the hard limit, balance_dirty_pages still loops until
dirty+writeback go below the threshold.

Thanks,
Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html