Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Thu, 8 Oct 2009 09:58:22 +0800

On Thu, Oct 08, 2009 at 09:01:59AM +0800, KAMEZAWA Hiroyuki wrote:
> tatus: RO
> Content-Length: 12481
> Lines: 332
> 
> On Wed, 07 Oct 2009 15:38:36 +0800
> Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
> 
> > As proposed by Chris, Dave and Jan, let balance_dirty_pages() wait for
> > the per-bdi flusher to writeback enough pages for it, instead of
> > starting foreground writeback by itself. By doing so we harvest two
> > benefits:
> > - avoid concurrent writeback of multiple inodes (Dave Chinner)
> >   If every thread doing writes and being throttled start foreground
> >   writeback, it leads to N IO submitters from at least N different
> >   inodes at the same time, end up with N different sets of IO being
> >   issued with potentially zero locality to each other, resulting in
> >   much lower elevator sort/merge efficiency and hence we seek the disk
> >   all over the place to service the different sets of IO.
> >   OTOH, if there is only one submission thread, it doesn't jump between
> >   inodes in the same way when congestion clears - it keeps writing to
> >   the same inode, resulting in large related chunks of sequential IOs
> >   being issued to the disk. This is more efficient than the above
> >   foreground writeback because the elevator works better and the disk
> >   seeks less.
> > - avoid one constraint torwards huge per-file nr_to_write
> >   The write_chunk used by balance_dirty_pages() should be small enough to
> >   prevent user noticeable one-shot latency. Ie. each sleep/wait inside
> >   balance_dirty_pages() shall be small enough. When it starts its own
> >   writeback, it must specify a small nr_to_write. The throttle wait queue
> >   removes this dependancy by the way.
> >
> 
> May I ask a question ? (maybe not directly related to this patch itself, sorry)

Sure :)

> Recent works as "writeback: switch to per-bdi threads for flushing data"
> removed congestion_wait() from balance_dirty_pages() and added
> schedule_timeout_interruptible().
> 
> And this one replaces it with wake_up+wait_queue.

Right. 

> IIUC, "iowait" cpustat data was calculated by runqueue->nr_iowait as
> == kernel/schec.c
> void account_idle_time(cputime_t cputime)
> {
>         struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
>         cputime64_t cputime64 = cputime_to_cputime64(cputime);
>         struct rq *rq = this_rq();
> 
>         if (atomic_read(&rq->nr_iowait) > 0)
>                 cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
>         else
>                 cpustat->idle = cputime64_add(cpustat->idle, cputime64);
> }
> ==
> Then, for showing "cpu is in iowait", runqueue->nr_iowait should be modified
> at some places. In old kernel, congestion_wait() at el did that by calling
> io_schedule_timeout().
> 
> How this runqueue->nr_iowait is handled now ?

Good question. io_schedule() has an old comment for throttling IO wait:

         * But don't do that if it is a deliberate, throttling IO wait (this task
         * has set its backing_dev_info: the queue against which it should throttle)
         */
        void __sched io_schedule(void)

So it looks both Jens' and this patch behaves right in ignoring the
iowait accounting for balance_dirty_pages() :)

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html